Pre MeRe dw De! 
si 


tere As watntetatenie 


semen An ae a Sa ma NNN 
a heette 


ie 


Meee 
sine eae ane et Nn 


paral ane aa 


Frat aatin me = ae 


anata 
Ne 
arin alta Tessar niin Tense 
1 
natn tne ee het NN tt 
preteens 


soa 
cram ha Nd Tie aaa NR 
ote naa TS 


at ea 
Sate! 


SeonESee eee ae 


Tea teint Sain es at 


paceeleer rere raed 


Fn PiN nt astm aa oP 


eee 


SoS 


secant 


an as Sete a nt bea me NT a aE i 
Fatal ine Nt 


pbs tirsimerier epi iest ee ane og 


ata ac anre 


paalencsepeerit cane oo 


ee 


Rec eset re 


st hw an 


enleraiumaiatna Sete 
eee eee oy 


tA an NU 
Ne aa ae tl aE 


Fin F etiam Sh ee Nin eet Benet tino 
al 


caterer = = 
Le ie a re, 
cash tea ashe Bm 
thee Fant mie Nah 
aE mt! Fat al Pht TN nO 
tet tae ae Tee at as et at tl 
Cat a at at a Mate at 


Pree es 
eee eed 


I ee atin 
eile Resear neared ea eas 
Fs men ae 


Fait a nl atte tlaaP nc 


Sa itata icles nd ita 


ai ce ale Ta: 
aM Re an ain Mal 


Mattar Pakale 74 Nae 
tetera? 


tea a in Mae ata Ma Me 
ala Maite taKnerntres 


Py 


i /% 


\ ‘4 aS i eae 


4 » - “aX ’ 7 f f . 
> fA Mee og rf . 
Pes hs 


Digitized by the Internet Archive 
in 2023 with funding from 
University of Toronto 


https://archive.org/details/31/61103745436 


3 


Severnme t 


12 -OO1 f | 


Survey 
Methodology 


Catalogue No. 12-001-XPB 


_ A journal 


Badin 
Rees 
ees 


es oF ivi 
BoE fee sete Canada 


How to obtain more information 


For information about this product or the wide range of services and data available from Statistics Canada, visit our website at 
www.statcan.gc.ca, e-mail us at infostats @ statcan.gc.ca, or telephone us, Monday to Friday from 8:30 a.m. to 4:30 p.m., at the 
following numbers: 


Statistics Canada’s National Contact Centre 
Toll-free telephone (Canada and United States): 


Inquiries line 1-800-263-1136 
National telecommunications device for the hearing impaired 1-800-363-7629 
Fax line 1-877-287-4369 


Local or international calls: 
Inquiries line 1-613-951-8116 
Fax line 1-613-951-0581 


Depository Services Program 
Inquiries line 1-800-635-7943 
Fax line 1-800-565-7757 


To access and order this product 


This product, Catalogue no. 12-000-X, is available free in electronic format. To obtain a single issue, visit our website at 
www.statcan.gc.ca and browse by “Key resource” > “Publications.” 


This product is also available as a standard printed publication at a price of CAN$30.00 per issue and CAN$58.00 for a one-year 
subscription. 


The following additional shipping charges apply for delivery outside Canada: 


Single issue Annual subscription 


United States CAN$6.00 CAN$12.00 
Other countries CAN$10.00 CAN$20.00 


All prices exclude sales taxes. 
The printed version of this publication can be ordered as follows: 


e Telephone (Canada and United States) 1-800-267-6677 


e Fax (Canada and United States) 1-877-287-4369 

e E-mail infostats @ statcan.gc.ca 

¢ Mail Statistics Canada 
Finance 


R.H. Coats Bldg., 6th Floor 
150 Tunney's Pasture Driveway 
Ottawa, Ontario K1A OT6 

e In person from authorized agents and bookstores. 


When notifying us of a change in your address, please provide both old and new addresses. 


Standards of service to the public 


Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, Statistics Canada 
has developed standards of service that its employees observe. To obtain a copy of these service standards, please contact 
Statistics Canada toll-free at 1-800-263-1136. The service standards are also published on www.statcan.gc.ca under “About us” > 
“The agency” > “Providing services to Canadians.” 


Survey 
Methodology 


iv 


A journal 
published by 
Statistics Canada 


June 2012 e Volume 38 e Number 1 


Published by authority of the Minister responsible for Statistics Canada 
© Minister of Industry, 2012 
All rights reserved. This product cannot be reproduced and/or transmitted to any person or 
organization outside of the licensee’s organization. Reasonable rights of use of the content of this product are 
granted solely for personal, corporate or public policy research, or for educational purposes. 

This permission includes the use of the content in analyses and the reporting of results and conclusions, 
including the citation of limited amounts of supporting data extracted from this product. These materials are solely 
for non-commercial purposes. In such cases, the source of the data must be acknowledged as follows: 
Source (or “Adapted from”, if appropriate): Statistics Canada, year of publication, name of product, catalogue number, 
volume and issue numbers, reference period and page(s). Otherwise, users shall seek prior written permission 
of Licensing Services, Information Management Division, Statistics Canada, Ottawa, Ontario, Canada K1A OT6. 
June 2012 
Catalogue no. 12-001-XPB 
Frequency: Semi-annual 
ISSN 0714-0045 


Ottawa 


ea | 
Statistics Statistique C d 
Canada Pare and a 


* Re Mhe: ee 


SURVEY METHODOLOGY 
A Journal Published by Statistics Canada 


Survey Methodology is indexed in The ISI Web of knowledge (Web of science), The Survey Statistician, Statistical Theory and 
Methods Abstracts and SRM Database of Social Research Methodology, Erasmus University and is referenced in the Current Index 
to Statistics, and Journal Contents in Qualitative Methods. It is also covered by SCOPUS in the Elsevier Bibliographic Databases. 


MANAGEMENT BOARD 


Chairman J. Kovar 

Past Chairmen D. Royce (2006-2009) 
G.J. Brackstone (1986-2005) 
R. Platek (1975-1986) 


EDITORIAL BOARD 


Editor M.A. Hidiroglou, Statistics Canada 
Deputy Editor H. Mantel, Statistics Canada 


Associate Editors 


J.-F. Beaumont, Statistics Canada 

J. van den Brakel, Statistics Netherlands 

J.M. Brick, Westat Inc. 

P. Cantwell, U.S. Bureau of the Census 

R. Chambers, Centre for Statistical and Survey Methodology 
J.L. Eltinge, U.S. Bureau of Labor Statistics 

W.A. Fuller, Jowa State University 

J. Gambino, Statistics Canada 

D. Haziza, Université de Montréal 

B. Hulliger, University of Applied Sciences Northwestern Switzerland 
D. Judkins, Westat Inc 

D. Kasprzyk, NORC at the University of Chicago 

P. Kott, National Agricultural Statistics Service 

P. Lahiri, JPSM, University of Maryland 

P. Lavallée, Statistics Canada 

P. Lynn, Univerity of Essex 

D.J. Malec, National Center for Health Statistics 

G. Nathan, Hebrew University 

J. Opsomer, Colorado State University 


G. Beaudoin 

S. Fortier (Production Manager) 
J. Gambino 

M.A. Hidiroglou 

H. Mantel 


Members 


Past Editor J. Kovar (2006-2009) 
M.P. Singh (1975-2005) 


D. Pfeffermann, Hebrew University 

N.G.N. Prasad, University of Alberta 

J.N.K. Rao, Carleton University 

J. Reiter, Duke University 

L.-P. Rivest, Université Laval 

N. Schenker, National Center for Health Statistics 
F.J. Scheuren, National Opinion Research Center 
P. do N. Silva, Escola Nacional de Ciéncias Estatisticas 
P. Smith, Office for National Statistics 

E. Stasny, Ohio State University 

D. Steel, University of Wollongong 

L. Stokes, Southern Methodist University 

M. Thompson, University of Waterloo 

V.J. Verma, Universita degli Studi di Siena 

K.M. Wolter, Jowa State University 

C. Wu, University of Waterloo 

W. Yung, Statistics Canada 

A. Zaslavsky, Harvard University 


Assistant Editors C. Bocci, K. Bosa, P. Dick, G. Dubreuil, S. Godbout, C. Leon, S. Matthews, Z. Patak, S. Rubin-Bleuer and 


Y. You, Statistics Canada 


EDITORIAL POLICY 


Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such 
as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, 
survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, 
estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and 
evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the 
authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial 
Board or of Statistics Canada. 


Submission of Manuscripts 


Survey Methodology is published twice a year. Authors are invited to submit their articles in English or French in electronic form, 
preferably in Word to the Editor, (smj@statcan.gc.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, 
KIA OT6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca). 


Subscription Rates 


The price of printed versions of Survey Methodology (Catalogue No. 12-001-XPB) is CDN $58 per year. The price excludes 
Canadian sales taxes. Additional shipping charges apply for delivery outside Canada: United States, CDN $12 ($6 x 2 issues); Other 
Countries, CDN $20 ($10 x 2 issues). A reduced price is available to members of the American Statistical Association, the 
International Association of Survey Statisticians, the American Association for Public Opinion Research, the Statistical Society of 
Canada and |’Association des statisticiennes et statisticiens du Québec. Electronic versions are available on Statistics Canada’s web 
site: www.statcan.gc.ca. 


Survey Methodology 
A Journal Published by Statistics Canada 
Volume 38, Number 1, June 2012 


Contents 
Regular Papers 
Sebastian Bredl, Peter Winker and Kerstin K6tschau 
Pestausiical dyproacii 10 Uetect Inter Viewer 1alsiliCallOM OF SUIVEY CALA sccrecetts case srtry-tccsvesuessceceasercesnssachocesssrosecartaversenss | 
Steven Elliott 
The application of graph theory to the development and testing of survey instrument ........0...c:cccccceseeseeseeseeseeeesees 11 


G. Hussain Choudhry, J.N.K. Rao and Michael A. Hidiroglou 


Onsen levallOCaiOnOnerriclent Comal CSlmmaOn a. stccrt ett nee ae eee ese oud eee een ee eee rican erayik es 2B 
Ted Chang 
Calioration altematives to: poststratification. for doubly classinicd’datare. ca. ectcce. este ete rres, coe ceten tease cote te ceonecas Bi 


Paul Knottnerus and Arnout van Delden 
On variances of changes estimated from rotating panels and dynamic strata...........cccecceeeseseesessceseesceeeseeseeseeseeeeseeaees 43 


Dan Liao and Richard Valliant 
NV aiance intladonsactors.m.the analysis OL complex SUIVEY al ak wnc. eles -pencnce cene-cosees cea pak Perenesct ie es-7uecestyctoomoomes aes 53 


Hung-Mo Lin, Hae-Young Kim, John M. Williamson and Virginia M. Lesser 
PSiimalin oapreeiient COCINCICNtS (Ol SAINP Ie SUPVEW Catia. mates taster is ones we nals eta cee nese eens chico sergo'gecados snedeaes doessé 63 


Jérg Drechsler and Jerome P. Reiter 
Combining synthetic data with subsampling to create public use microdata files for large scale surveys .............00+ fis: 


Balgobin Nandram and Myron Katzoff 

A hierarchical Bayesian nonresponse model for two-way categorical data from small areas 

Wat UMACeNLAlhiyY ADOLIEM PNOra Di lit VaR. CaM e es cw sc cae oa eee Oc aetn eon eee ENS, I), Je ROR Ne 81 
Short Notes 
Phillip S. Kott 

Why one should incorporate the design weights when adjusting for unit nonresponse using 


BECO MSe ORM DENICILY OTOL IS teeth Ste eee eee er nme ese ea et eM anaes nse aide teats tan pM hicoser 05 


LSAT PAYTPATT ET [Spd 00 0 ol 0 1 ae Nm i RPA) PO PR lh) 6 Sgt A A ER nag RIE a ere MERE I Pm 101 


Statistics Canada, Catalogue No. 12-001-X 


The paper used in this publication meets the minimum Le papier utilisé dans la présente publication répond aux 
requirements of American National Standard for Infor- exigences minimales de I’“American National Standard for Infor- 
mation Sciences — Permanence of Paper for Printed Library mation Sciences” — “Permanence of Paper for Printed Library 
Materials, ANSI 239.48 - 1984. ; Materials”, ANSI Z39.48 - 1984. 


Survey Methodology, June 2012 
Vol. 38, No. 1, pp. 1-10 
Statistics Canada, Catalogue No. 12-001-X 


A statistical approach to detect interviewer falsification of survey data 


Sebastian Bredl, Peter Winker and Kerstin Kétschau | 


Abstract 


Survey data are potentially affected by interviewer falsifications with data fabrication being the most blatant form. Even a 
small number of fabricated interviews might seriously impair the results of further empirical analysis. Besides reinterviews, 
some statistical approaches have been proposed for identifying this type of fraudulent behaviour. With the help of a small 
dataset, this paper demonstrates how cluster analysis, which is not commonly employed in this context, might be used to 
identify interviewers who falsify their work assignments. Several indicators are combined to classify ‘at risk’ interviewers 
based solely on the data collected. This multivariate classification seems superior to the application of a single indicator 


such as Benford’s law. 


Key Words: Data fabrication; Falsifier; Benford’s law; Cluster analysis. 


1. Introduction 


Whenever data collection is based on interviews, one has 
to be concerned about data quality. Data quality can be 
affected by false or imprecise answers of the respondent or 
by a poorly designed questionnaire, as well as by the inter- 
viewer when he or she deviates from the prescribed inter- 
viewing procedure. If the interviewer does so consciously, 
this is referred to as ‘interviewer falsification’ (Schreiner, 
Pennie and Newbrough 1988) or ‘cheating’ (Schrapler and 
Wagner 2003). 

Interviewer falsification can occur in many ways (cf. 
Guterbock 2008). Rather subtle forms consist of surveying a 
wrong household member or of conducting the survey by 
telephone when face-to-face interviews are required. The 
most severe form of falsifying is the fabrication of entire 
interviews without ever contacting the respective household. 
In our analysis, we deal with the latter case. 

Fabricated interviews can have serious consequences for 
Statistics based on the survey data. Schnell (1991) and 
Schrépler and Wagner (2003) provide evidence that the 
effect on univariate statistics might be less severe, provided 
the share of falsifiers remains sufficiently small and the 
‘quality’ of the fabricated data is high. But even a small 
proportion of fabricated interviews can be sufficient to 
cause heavy biases in multivariate statistics. Schrapler 
and Wagner (2003) find that the inclusion of fabricated 
data from the German Socio Economic Panel (GSOEP) 
in a multivariate regression reduces the effect of training 
on log gross wages by approximately 80 percent, 
although the share of fabricated interviews was less than 
2.5 percent. This indicates the importance of identifying 
these interviews. 


The most common way to identify falsifying inter- 
viewers 1s the reinterview (Biemer and Stokes 1989). In this 
case, a supervisor contacts some of the households that 
should have been surveyed to check whether they were 
actually visited by the interviewer. However, for reasons of 
expense, it is impossible to reinterview all households 
participating in a survey (cf. Forsman and Schreiner 1991). 
Therefore, the question arises of how the reinterview sample 
can be optimized to best detect falsifiers. Generally, it seems 
useful to select households for reinterview if the interviews 
were done by an interviewer — identified by characteristics 
linked to the answers in his interviews ~ who is more likely 
than others to be fabricating data. In this context, Hood and 
Bushery (1997) uses the term ‘at risk’ interviewer. If reinter- 
view participants are sampled in a two-stage setting, where- 
by interviewers are selected in the first stage and partici- 
pants surveyed by those interviewers in the second stage (as 
recommended by Forsman and Schreiner (1991)) one might 
oversample the at risk interviewers in the first stage. 

In this paper, we demonstrate a purley statistical ap- 
proach that relies on the data contained in the questionnaries 
to define a group of at risk interviewers. This is not a new 
idea; literature provides several examples for this kind of 
approach (Hood and Bushery 1997; Diekmann 2002; 
Turner, Gribbe, Al-Tayyip and Chromy 2002; Schrapler and 
Wagner 2003; Swanson, Cho and Eltinge 2003; Murphy, 
Baxter, Eyerman, Cunningham and Kennet 2004; Porras 
and English 2004; Schafer, Schrapler, Miller and Wagner 
2005; Li, Brick, Tran and Singer 2009). However, with the 
exception of the work of Li ef a/. (2009), the tests conducted 
in these studies rely on the examination of single indicators 
derived from the interviewer’s data to detect falsifiers. Some 
studies calculate several indicators but consider them all 
separately. We combine multiple indicators in cluster 
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analyses, allowing for a better classification of the potential 
falsifiers compared to previous approaches. To the best of 
our knowledge, this procedure is an innovation in the 
context of identifying interviewers who fabricate data, but 
has already been employed in other fields in order to detect 
fraudulent behaviour. The basic idea is that characteristics of 
fraudulent ‘cases’ (what a case is depends on the context) 
feature striking patterns compared to honest cases that can 
be revealed if those characteristics are jointly considered in 
a cluster analysis. Murad and Pinkas (1999) try to detect 
fraud in the telecommunication industry by means of 
clustering call profiles of clients. A call is characterized by 
several indicators like calling time or destination of the call. 
Thiprungsri (2010) clusters group life claims submitted 
from clients to life insurance companies based on several 
characteristics of the claims. Claims that form very small 
clusters are considered to be suspicious. Donoho (2004) 
uses cluster analysis, among others, to trace patterns in 
option markets that might indicate insider trading. 

We have a small survey dataset available (see subsection 
3.1 for a further description of our dataset), which partially 
consists of falsified data. With a total of 13 interviewers and 
250 questionnaires, the size of the dataset is quite limited 
and it is not clear to what extent our findings can be 
generalized to larger datasets. However the dataset enables 
us to demonstrate our approach. The fact that we know 
which data was collected honestly and which data was 
fabricated allows for a first evaluation of our approach. It 
must be stated that this a priori knowledge 1s no prerequisite 
to employ the method. 

The problem of identifying at risk interviewers was 
addressed in the 1980s, however, literature on this issue is 
still scarce. In 1982, the U.S. Census Bureau implemented 
the Interviewer Falsification Study. Based on the informa- 
tion collected in the context of this study, Schreiner et al. 
(1988) find that interviewers with a shorter length of service 
are more likely to fabricate data. Hood and Bushery (1997) 
use several indicators to find at risk interviewers in the 
National Health Interview Survey (NHIS). For example, 
they calculate the rate of households that have been labelled 
ineligible or the rate of households without telephone 
number per interviewer and compare the rates to census data 
from the respective area. When large differences occur, the 
interviewer is flagged and a reinterview is conducted. De- 
tection rates among the flagged interviewers turn out to be 
higher than those in random reinterview samples. Turner 
etal. (2002) also find interviewers committing data fabri- 
cation to indicate telephone numbers less frequently than 
honest interviewers when examining the Baltimore STD 
and Behaviour Survey. For the case of computer assisted 
interviewing, Bushery, Reichert, Albright and Rossiter 
(1999) and Murphy et al. (2004) propose the use of date and 
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time stamps - the recording of the time and the duration of 
the interview by the computer - to find suspect interviewers. 
Those who need a remarkably long or short time to com- 
plete the entire questionnaire or certain modules or complete 
remarkably many questionnaires within a given time period 
might be flagged as at risk interviewers. Schafer ef al. 
(2005) assume that falsifiers avoid extreme answers when 
fabricating data. Using data of the GSOEP, the authors 
calculate the variance of the answers for every question on 
all questionnaires of an interviewer and sum up all vari- 
ances. Thanks to other control mechanisms in the GSOEP, 
falsifiers are known and it turns out that they could be found 
among the interviewers with the lowest overall variances. 
Porras and English (2004) use a similar approach and also 
find falsifiers to produce variances that are smaller to those 
found in honestly filled questionnaires. Li etal. (2009) 
combine several predictive indicators in a logistic regression 
model in which the known falsification status of an inter- 
view serves as a binary dependent variable. The authors find 
that reinterview samples that overweight cases with a high 
probability of being fraudulent according to the logistic 
regression model identify more cases of actual data fabrica- 
tion than purely randomly drawn samples. However, it is 
evident that past reinterview data with known falsification 
status must be available to conduct the logistic regression. 

Further indicators discussed in literature are the number 
of rare or unlikely response combinations in an inter- 
viewer’s questionnaires (Murphy ef al. 2004; Porras and 
English 2004) and the comparison of household composi- 
tions or descriptive statistics in interviewer’s questionnaires 
with the entire sample (Turner ef a/. 2002; Murphy ef al. 
2004). 

Another means of detecting fabricated data that has 
gained a lot of popularity in recent years is Benford’s law 
(Schrapler and Wagner 2003; Swanson ef al. 2003; Porras 
and English 2004; Schafer etal. 2005), which will be 
discussed in section 2, along with its success in detecting 
fabricated interviews in previous studies. Furthermore, 
section 2 describes our statistical approach to identify 
falsifiers. Section 3 presents the data our analysis is based 
upon as well as our results. The paper concludes with a 
discussion of our findings. 


2. Methods 


2.1 Benford’s law 


When the physicist Frank Benford noticed that the pages 
in logarithmic tables containing the logarithms of low 
numbers (1 and 2) were more used than pages containing 
logarithms of higher numbers (8 and 9), he started to 
investigate the distribution of leading digits in a wide range 
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of different types of numbers like numbers on the first page 
of a newspaper, street addresses or molecular weights 
(Benford 1938). Benford found that the distribution of the 
leading non-zero digits could be described by the following 
formula which has become known as ‘Benford’s law:’ 


Prob(leading digit = d) = logo + “| (1) 


However, not all series of numbers Benford (1938) 
investigated seemed to conform to his law. Consequently, 
the question arose what kind of data can be supposed to 
produce first digits in line with the law. Discussions of this 
issue are provided by Hill (1995), Nigrini (1996), Hill 
(1999) and Scott and Fasli (2001). The detection of financial 
fraud is a field in which the application of Benford’s law has 
gained much popularity during the recent decade (Nigrini 
1996; 1999; Saville 2006). The results of those studies are 
not relevant in our context. However, it is interesting to note 
that there seems to be a consensus in literature that monetary 
values can be supposed to follow Benford’s law. Swanson 
et al. (2003) show that the distribution of first digits in the 
American Consumer Expenditure Survey is close to 
Benford’s distribution. 

The basic idea of using Benford’s law to detect fabricated 
data is that falsifiers are unlikely to know the law or to be 
able to fabricate data in line with it. Therefore a strong 
deviation of the leading digits from Benford’s distribution in 
a dataset indicates that the data might be faked. Of course, 
one has to be concerned if the nature of the data is such that 
it can be supposed to follow Benford’s law if it is authentic. 
Benford’s law cannot be applied if the questionnaires do not 
contain any or contain only very few metric variables. 

Schrapler and Wagner (2003) and Schafer ef al. (2005) 
use Benford’s law to detect data fabrication in the GSOEP. 
In both studies, all questionnaires delivered by every single 
interviewer are combined and checked for whether the 
distribution of the first digits in the respective questionnaires 
deviates significantly from Benford’s law. This can be done 
by calculating the y’ -statistic: 


9 2 
"7 =n, >»; LT eta Lb) (2) 


where n, is the number of leading digits in all question- 
naires from interviewer i, h,, is the observed proportion of 
leading digit d in all leading digits in interviewer i’s 
questionnaires and h,, is the proportion of leading digit d 
in all leading digits under Benford’s distribution. High ° - 
values indicate a deviation from Benford’s distribution and 
indicate at risk interviewers. Schrapler and Wagner (2003) 
use different kinds of continuous variables in their analysis, 
whereas Schafer et al. (2005) restrict theirs to monetary 
values. In both studies, the critical y’ -values are assumed to 
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be dependent on the sample size n and are consequently 
adjusted for this parameter. The results obtained look 
promising. The fit of the leading distribution of first digits to 
Benford’s distribution in the questionnaires of falsifiers 
(which were already known in advance) is, in general, much 
worse than for honest interviewers. Thus it seems appro- 
priate to use Benford’s law as a means to identify at risk 
interviewers. 

However, when we compared the data of the honest 
interviewers in our dataset to Benford’s distribution, we 
observed a large deviation for the digit 5. This might be due 
to rounding of numbers by the respondents. The same 
problem is mentioned by Swanson ef al. (2003) and Porras 
and English (2004) who opt for applying an alternative 
approach “in the spirit of Benford” (Porras and English 
2004, page 4224). We adopt this approach which consists of 
comparing the distribution of leading digits in the question- 
naires of an interviewer to the distribution of first digits in 
all questionnaires except their own. The x7-value on the 
interviewer level is calculated as described above but the 
expected proportion of a digit according to Benford’s law 
h,, 18 replaced by the proportion of the digit in all other 
questionnaires. We then use the resulting y°-value as one 
indicator in the cluster analysis. 

With regard to the selection of variables whose first 
digits are examined, we stick to the approach of Schafer 
et al..(2005) and include only the first digits of monetary 
values in the analysis. The survey we are using for demon- 
stration purposes contains monetary values expressed in 
local currency referring to household expenditures for dif- 
ferent items like leasing or buying land, seeds, fertilizer, 
taxes, and to household income from different sources like 
agricultural or non agricultural self employment and public 
or private transfers. Overall we include first digits of 26 
different monetary values per interview, ignoring values that 
were reported to be zero. We then pool first digits of all 
questionnaires delivered by one interviewer and compare 
the distribution of first digits to the one for all other 
interviews according to the method described above. The 
restriction to monetary values constitutes a clear criterion 
during the process of selecting data. Furthermore, as men- 
tioned above, financial data is broadly agreed upon to be apt 
for the analysis with Benford’s law. This is important, al- 
though we do not ground our analysis on Benford’s distri- 
bution but on an approach based on it. 


2.2 Multivariate analyses 


Our idea is to combine several indicators, which we 
derive directly from the questionnaires of each interviewer 
and which we suppose to be different for falsifiers and 
honest interviewers. We do this by means of cluster and 
discriminant analysis. All indicators are derived on the 


Statistics Canada, Catalogue No. 12-001-X 


4 Bredl, Winker and Koétschau: A statistical approach to detect interviewer falsification of survey data 


interviewer level. This implies that we pool all question- 
naires of one interviewer for the analysis, which increases 
the amount of data on which every single indicator value is 
based. This should make the indicator values more reliable 
and less sensitive to outliers. On the other hand, it is obvious 
that the discriminatory power of interviewer-level indicators 
decreases as soon as interviewers only fake parts of their 
assignments. Looking at indicators on the questionnaire 
level, therefore, seems to be preferable if the amount of data 
per questionnaire is sufficiently high. 

The cluster analysis constitutes the real method of iden- 
tifying at risk interviewers. The interviewers are clustered in 
two groups with the intention of obtaining one that contains 
a high share of falsifiers and another one that contains a high 
share of honest interviewers. Clustering does not require a 
priori information on who is fabricating data and who is not. 
In fact, this is what it is supposed to reveal. Since we know 
from the outset which interviewer belongs to which group, 
we can discover whether the cluster analysis identifies the 
‘true falsifiers’ to be at risk. Clearly, the assumption that our 
approach is able to separate both groups perfectly is not 
realistic. The idea is rather that we obtain an at risk inter- 
viewer cluster exhibiting a higher share of falsifiers com- 
pared to the other cluster. If a reinterview is feasible, sub- 
sequent reinterview efforts might be focused on interviewers 
in the at risk cluster. 

To judge the performance of the cluster analysis, we 
consider the number of undetected falsifiers as well as the 
number of ‘false alarms.’ Both types of ‘errors’ inccur costs: 
data of undetected falsifiers is likely to impair the results of 
further statistical analysis. False alarms inccur costs in the 
sense that an unnecessary effort to reinterview the respective 
households might be taken or data is unnecessarily removed 
from the sample. Furthermore, it might be demoralizing for 
honest interviewers if they see their work being subject to a 
reinterview, particuliarly if they are aware of the fact that 
predominantly the work of at-risk interviewers is picked. 
How to weight an undetected falsifier compared to a false 
alarm in a loss function is a highly subjective issue. Gener- 
ally, it seems reasonable to assign more weight to the former 
than to the latter. 

The discriminant analysis requires knowledge on the 
falsifiers versus non-falsifiers status of each interviewer 
before it can be conducted. Therefore, it is not an instrument 
to detect falsifiers. We use the discriminant analysis to 
verify our hypotheses on the behaviour of falsifiers, which 
will be discussed below, and to evaluate how well the 
employed indicators can separate the two groups. 

One of the indicators we use is the 7 -value, calculated 
by comparing the distribution of first digits in the ques- 
tionnaires of each interviewer with the respective dis- 
tribution in all other questionnaires as described in the 
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previous subsection. Furthermore, we derive three other 
indicators from hypotheses concerning the behaviour of 
falsifiers fabricating data. Schafer et al. (2005) assume that 
falsifiers have a tendency to answer every question, thus 
producing less missing values. Furthermore, in line with 
Porras and English (2004), they expect falsifiers to choose 
less extreme answers to ordinal questions. Hood and 
Bushery (1997) hypothesize that falsifiers will “try to keep it 
simple and fabricate a minimum of falsified data” (Hood 
and Bushery 1997, page 820). 

Based on these assumptions, we calculate three propor- 
tions, which serve as indicator variables in the multivariate 
analyses along with the y° -value. The three indicator vari- 
ables are calculated as follows: 


The ‘item-non-response-ratio’ is the proportion of ques- 
tions which remain unanswered in all questions. We 
expect this ratio to be lower for falsifiers than for honest 
interviewers. 

The ‘extreme-answers-ratio’ refers to answers which 
are measured in ordinal scales. The ratio indicates the 
share of extreme answers (the lowest or highest 
category on the scale) in all ordinal answers. According 
to the above-mentioned assumptions, this ratio should 
also be lower for falsifiers. 

The ‘others-ratio’ refers to questions which, besides 
several framed responses offer the item ‘others’ as a 
possible answer. The choice of this item requires the 
explicit declaration of an alternative. If falsifiers tend to 
keep it simple, we can expect them to prefer the framed 
responses to the declaration of an alternative. Thus, this 
ratio too (calculated as the proportion of ‘others’ 
answers in all answers where the others item is 
selectable) should be lower for falsifiers. 


Of course, the list of indicator variables, which might be 
included in the cluster analysis, can be extended. Generally, 
it is possible to derive many more of those variables from 
hypotheses on the behaviour of interviewers who fabricate 
data or to use those which have already been proposed in the 
literature, albeit not in the context of cluster analysis. For 
example, based on the assumption that falsifiers try to 
fabricate a minimum of falsified data, Hood and Bushery 
(1997) expect them to disproportionately often select the 
answer ‘No’ to questions, which either lead to a set of new 
questions or avoid it (assuming that ‘No’ is generally the 
answer that avoids further questions). So one could calculate 
the ratio of ‘No’ answers to such questions and use this ratio 
as a variable in the cluster analysis. We do not use this ratio, 
as two slightly different versions of the questionnaire were 
used in our empirical sample. There are only a small 
number of questions that lead to new questions or avoid 
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them depending on the answers, which are identical in both 
versions of the questionnaire. 

Furthermore, when computer assisted interviewing al- 
lows the use of date and time stamps as discussed by 
Bushery ef al. (1999), the average time needed to conduct 
an interview or the number of interviews conducted in one 
day might serve as indicators. Panel surveys offer some 
additional information to construct indicators. Stokes and 
Jones (1989) propose to compare the actual rate of non- 
matched household members in an interviewer’s question- 
naires to expected nonmatch rates that are calculated condi- 
tional on several household characteristics. The authors 
employ this procedure in the post-enumeration survey that is 
conducted as follow-up survey for the U.S. Census. If the 
actual rate of nonmatches strongly exceeds the expected 
rate, the authors consider this to be an indicator for fabri- 
cated data. Generally, this approach is applicable as soon as 
one has two or more waves of a panel survey available. 

It becomes obvious that the first steps of our approach 
consist of examining the structure of the questionnaire and 
other types of data like date or time stamps collected during 
the survey process. Then one might consider which indica- 
tors could be derived from those sources that are likely to 
differ between falsifiers and honest interviewers. Another 
approach is the use of data mining techniques to identify 
patterns that are common in fabricated data or patterns in 
which fabricated data differs from honestly collected data 
(Murphy, Eyerman, McCue, Hottinger and Kennet 2005). If 
those patterns are detected, they might be used as indicators 
instead of deriving indicators from hypothesis on falsifier 
behaviour. However, this approach requires a huge dataset 
with known cases of falsification in order to conduct the 
data mining process. Such a dataset is not always available. 


3. Results 


3.1 Data sources 


The data used in this study are derived from household 
surveys conducted in November 2007 and February 2008 in 
a Commonwealth of Independent States (CIS) (i.e., former 
Soviet Union) country. The survey was part of an inter- 
national research project on land reforms and rural poverty. 
We intended to interview 200 households in four villages in 
2007. After identifying that all interviews had been 
fabricated in the first surveyed village we broke the survey 
off and started a new round with new interviewers in other 
villages in February 2008. All villages had been selected by 
qualitative criteria like the agricultural production structure 
and the implementation of land reforms. The households 
within one village had been selected by random sample 
based on household lists, which were provided by the 
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mayors of the villages. This procedure not only assured that 
all households had been selected at random, but also 
provided the basis for reinterviews as all households were 
exactly defined. However, these reinterviews were not 
planned in the very beginning. Because the households 
rarely owned telephones, check-calls were not possible and 
reinterviews in these households were associated with high 
costs and expenditure of time for traveling to the village for 
a face-to-face reinterview. Five interviewers were engaged 
in the first 2007 survey. Two of them had been the local 
partners of the research project. They had been involved in 
the development of the questionnaire and were responsible 
for the coordination of the surveys in their country. The 
other three interviewers were students hired by the partners. 
The questionnaire was composed of different sections with 
regard to household characteristics, resource endowment as 
well as income and expenditures. Most of the questions 
were closed questions. Only a few questions included a 
scale. Metric variables were collected for household expen- 
ditures like leasing or buying land, seeds, fertilizer or taxes 
and household income from different sources like agricul- 
tural or non-agricultural self employment and public or 
private transfers. 

When the interviews of the 2007 survey were conducted, 
none of the German researchers were present in the villages. 
The questionnaires were collected right after the survey of 
the first village. In a first review of the questionnaires, we 
became suspicious because the paper of the questionnaires 
looked very clean and white. There was no dirt or dog-ears 
on the paper. Comparing the answers of different question- 
naires of one interviewer we found two questionnaires with 
identical answers. Considering the fact that we asked for the 
amount of income from different sources in metric numbers 
it was very unlikely that the answers of two questionnaires 
would have been identical. Not getting any explanations 
from the project partners, we reinterviewed a sub-sample of 
10% of the original sample face-to-face. None of the 
reinterviewed households reported having been surveyed. 
After detecting the fabrication of the interviews, the partners 
acknowledged that all interviews had been fabricated. As a 
matter of course, we stopped working with all interviewers 
and partners and implemented a new local research group. 

In February 2008, the survey was repeated in the same 
country. As mentioned before, we selected new villages and 
households according to the above-mentioned criteria. We 
hired nine students for the interviews and arranged the 
survey with on-site supervision. In most cases, the 
interviews took place in a school or the city hall so that we 
could monitor all interviewers. When the interviews took 
place in the houses of the surveyed families we attended 
some of them. Due to this procedure, we presume that the 
questionnaires from the 2008 survey are not fabricated. 
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In this paper, we use a total of 250 household interviews 
by 13 interviewers, of which four were falsifiers from the 
2007 survey (the interviews submitted by one falsifier were 
excluded as he filled in only three questionnaires) who 
definitely faked the results, referred to as F1-F4, and nine 
interviewers who are supposed to be honest, labelled H1- 
H9. Table | provides an overview of the number of ques- 
tionnaires per interviewer, which were included in the 
analysis. 


Table 1 
Number of questionnaires per interviewer 


Interviewer |Fl F2 F3 F4 H1 H2 H3 H4 H5 H6 H7 H8 H9 


Number of 


questionnamess | NOM aa LOOM 225 Ses en See ue eat 


3.2 Cluster analysis 


In this subsection, we present the results of the cluster 
analysis. Based on the results, we evaluate the success of 
our procedure in identifying interviewers who fabricate 
data. As already mentioned, we use four indicator variables 
in the cluster analysis: the item-non-response ratio, the 
proportion of extreme ordinally scaled answers in all ordi- 
nally scaled answers referred to as extreme ratio, the 
proportion of answers where the others item including an 
alternative was selected in all answers which offered this 
item (referred to as others ratio) and the x? -value stemming 
from the comparison of the leading digit distribution in the 
questionnaires of an interviewer with the respective 
distribution in all other questionnaires. 

Table 2 provides the values of the four indicator variables 
included in the cluster analysis for all 13 interviewers. It 
shows that the item-non-response ratio and the others ratio 
are clearly lower for the four falsifiers than for the honest 
interviewers. Fl and F4 have not chosen the others item at 
all. For the extreme ratio, things seem to be less clear. All 
the values range between 40% and 70% except the value of 
interviewer F1, which is clearly lower. The y7-values are 
quite high for falsifiers F2 and F4. The values of the other 
two falsifiers do not differ much from the ones observed for 
honest interviewers. 

The general idea of cluster analysis is to identify 
subgroups of elements in a space of elements that are all 
characterized by multivariate measurements (see Hardle and 
Simar (2007) for an introduction to cluster analysis). In the 
first step, a measure to determine either distance or 
similarity between elements has to be chosen. In the second 
step, elements are assigned to different subgroups or 
clusters. Elements within one cluster should be similar 
according to the selected measure whereas elements in 
different clusters should be distant. There is a large variety 
of methods according to which elements can be assigned to 
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clusters whereby the number of clusters might either be 
fixed or determined by the cluster method. 


Table 2 
Values of the variables included in the cluster analysis for each 
interviewer (all values except x7 -value in percent) 


Interviewer Item-Non-Response Others Extreme 7 -value 
Fl 1.36 0.00 28.33 19.63 
ia) 0.71 0.65 40.85 29.70 
F3 0.68 263 56.90 11.34 
F4 0.51 0.00 58.62 21.33 
HI 3.85 18.01 65.12 14.48 
H2 1.99 2.40 59.42 6.91 
H3 3.10 9.47 70.07 15.49 
H4 4.52 13.04 56.43 16.61 
HS 1.18 4.48 70.07 12.16 
H6 3.46 137, 50.75 15.42 
H7 25il 12572 45.65 On| 
H8 ihe) 10.95 69.85 3.63 
H9 0.14 1.61 69.44 19.14 


We measured distance as squared Euclidian distance and 
employed several cluster procedures in order to check the 
robustness of the results. In all cases, the interviewers have 
been clustered in two groups with the intention to obtain one 
‘falsifier group’ and one ‘honest interviewer group.’ The 
advantage of this approach is that a clear classification is 
obtained. In contrast, when one of the indicator variables is 
examined separately, it is not clear where to draw the line 
separating falsifiers and honest interviewers. Before con- 
ducting the cluster analysis, we standardized all variables on 
a mean of zero and on a variance of unity. This eliminates 
the scale effect as distances are measured in standard 
deviations and not in different units. 

The first clustering method we use is_ hierarchical 
clustering. This is a standard procedure that can also be 
applied to larger datasets and is implemented in standard 
statistical software packages. Hierarchical clustering merges 
clusters step by step, combining the two closest clusters. At 
the beginning, every element is considered as a separate 
cluster. We measure distance between two clusters as the 
average squared Euclidian distance between all possible 
pairs of elements with the first element of the pair coming 
from one cluster and the second element from the other 
cluster. We used the software package STATA with the 
option ‘average linkage’ to conduct the hierarchical cluster 
analysis. 

In hierarchical cluster analysis, two elements will stay in 
the same cluster once they are merged together. Thus, the 
procedure does not necessarily lead to a global optimum 
with regard to a given distance measure. In our case the 
relatively low number of interviewers allows us to conduct 
an alternative analysis by simply examining all possible 
cluster compositions and select the best one with regard to a 
certain target function. (The analysis was carried out in 
MATLAB, the programm code is available upon request.) 
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This procedure is clearly superior to hierarchical clustering 
as it ensures that the globally optimal cluster composition is 
identified. However, we also provide the results of hierar- 
chical clustering as it is rather feasible compared to the 
computationally intensive approach of trying all possible 
compositions when the number of interviewers rises. 
Alternatively, one might resort to heuristic optimization 
techniques. 

When examining all possible cluster compositions we 
use two target functions. The first one combines the ideas 
that a large distance between the two cluster centers is 
eligible as well as a small distance between the elements of 
a cluster and the cluster center. We look for the cluster 
composition, which maximizes the following expression: 
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DC - dy.) 
ere See Sree 
Shs ese aes GG, — dy) 
j=l i=l yee al 


The index 7 represents the four different indicator variables, 
d,, with a = 1,2 is the mean of variable i in cluster a, j 
symbolizes the different elements (interviewers) in cluster 1 
and cluster 2, d; is the value of variable i for element /, 
and n, is the number of elements in cluster 1. Thus the 
numerator measures the distance between the two clusters, 
the denominator the distance within clusters and distance is 
measured in squared Euclidian form. 

Alternatively, it could be interesting to see what optimal 
cluster composition results if instead of maximizing Equa- 
tion (3) the average squared Euclidian distance between all 
possible pairs of elements within one cluster is minimized. 
In fact, this idea is very similar to the relevant target func- 
tion in the hierarchical cluster procedures we presented 
before. Our second distance measure, which this time is to 
be minimized, is calculated as follows: 


n-l on 13 teal SS 
ee mae SEDs 
j=l k=j+l PS Nie el ( ) 


rn SNe lo lo en Sb 


SED ,, is the squared Euclidian distance between elements 
j and k, calculated as SED, = Xii(d, —d,)°. The 
numerator is the sum of distances between all possible pairs 
of elements in the same cluster. By dividing this sum by the 
number of possible pairs, one obtains the average within 
cluster distance. 

Table 3 reveals the results of the three cluster procedures. 
In the hierarchical analysis with linkage between groups, the 
three falsifiers F1, F2 and F4 form cluster 1, falsifier F3 and 
all honest interviewers form cluster 2. Thus, we are able 
to separate both groups of interviewers, except one falsi- 
fier. However, without knowing from the outset which 
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interviewers fabricated data and which were honest, one 
would have to decide which of the two clusters contains the 
at risk interviewers. This can be done by comparing the 
means of the indicator variables for each cluster displayed in 
Table 4. For the hierarchical procedure, means of the item- 
non-response ratio and the others ratio are clearly lower in 
cluster 1. The same 1s true for the mean of the extreme ratio, 
albeit the difference between the two clusters is less striking. 
Finally, a higher mean of the x*-value can be observed for 
cluster 1. Given these results, one would - according to the 
above mentioned hypotheses on the behaviour of falsifiers - 
correctly identify cluster 1 to be the cluster containing the at 
risk interviewers. We also tried to improve the results of the 
hierarchical clustering procedure using the cluster means 
displayed in Table 4 as starting point for the K-means 
analysis. However, the application of K-means clustering 
did not lead to any changes in the cluster composition. 


Table 3 
Results of the three employed clustering procedures 


Hierarchical clustering 


Interviewer wil) F2 R35 F455 WHOS M4 HSS HOSA 7a sie 
Cluster LOPE ee CS TI Os PIN ey een PE PRS WP 


Distance between clusters divided by distance within clusters 


Interviewers Fl 2) ho E4 ED HS HAS ES M6 7 Es. 9 
Cluster liga Silla 525 ae a ee pee Pe Pe 


Distance between elements in one cluster 


Interviewer Flr? FS" Fa Hl Ho HS 4 HS Hoe Ay Hs) He 
Clustes THO NOG) Dies OSE Ny OD) Veter 


Table 4 
Indicator variable means by cluster for the three cluster 
compositions 


Item-Non-Response Others Extreme 2 -value 
Hierarchical clustering 
Cluster I 2 | 2 1 2 | Z 
Mean 0.86 2.32 0.22 7.64 42.60 61.37 25.55 12.43 


Distance between clusters divided by distance within clusters 


Cluster 1 2 1 2 l 2 | 2 
Mean 0.86 2.32 O22 WES ARE Ogi VSS. 124 


Distance between elements in one cluster 


Cluster 1 2 1 2 | 2 l 2 
Mean 0.68 2.80 0.92 9.06 50.83 60.92 21.43 11.73 


The cluster composition that maximizes Equation (3) is 
identical to the one obtained using hierarchical clustering. 
Consequently, as can be seen from Table 4, the indicator 
means within the two clusters are identical as well. 

The cluster composition minimizing Equation (4) is 
slightly different. Cluster 1 now contains all falsifiers and 
one honest interviewer. The means of the indicator variables 
again clearly indicate cluster 1 to be the cluster containing 
the at risk interviewers. This is a very satisfying result. All 
falsifiers are identified and only one false alarm is produced. 
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However, it should be kept in mind that this does not mean 
that this particular cluster method works best when applied 
to another dataset. 

To evaluate to what extent a higher number of indicators 
leads to better results, we repeated our cluster approach 
based on Equations 3 and 4 with all possible combinations 
of indicators, including cases that only rely on one indicator. 
The results (see Table 7 in the appendix) generally indicate 
that an increasing number of indicators improves the results. 
However, there are also combinations with a smaller 
number of indicators that lead to similar results compared to 
those based on all four indicators. Determining which 
indicator composition is the best would require the highly 
subjective fixation of the relative cost caused by non- 
identified falsifiers compared to the cost caused by a false 
alarm. But one can determine which indicator compositions 
are not Pareto dominated in the sense that there is no other 
composition that exhibits less non-identified falsifiers (false 
alarms) and at the same time not more false alarms (non- 
identified falsifiers). The indicator composition including all 
four indicators is the only one that is not Pareto dominated 
no matter which equation is used. In contrast, compositions 
including only one indicator are Pareto dominated in six out 
of eight cases. 


3.3. Discriminant analysis 


Finally, we turn to the discriminant analysis to check 
whether the hypotheses on falsifiers’ behaviour our cluster 
analysis is based upon are valid. Discriminant analysis can 
be used if the clusters are known in order to assess how well 
the indicators in the analysis can separate the different 
groups and whether group membership can be predicted 
correctly (see Hardle and Simar (2007) for an introduction 
to discriminant analysis). In a linear discriminant analysis, 
the coefficients 5, and b, of the discriminant function 
D = by) + X/-,), x, are determined in such a way that they 
maximize a function that increases with the difference of the 
mean D -values of the two different groups and at the same 
time decreases with the differences of the D-values of 
elements within the groups. In our case, the x, are our four 
indicator variables and we obtain two groups by separating 
falsifiers and honest interviewers. 

We use prior probabilities corresponding to the relative 
group size (4/13 and 9/13) in order to predict group mem- 
bership. Table 5 shows the results. Obviously the four vari- 
ables allow a good separation of the falsifiers and the honest 
interviewers, as the group membership is correctly predicted 
in all cases but one. 

As can be seen from Table 5 negative values of the 
discriminant function are associated with the falsifier group. 
Consequently, Table 6 indicates that three of the four 
coefficients’ signs are in line with the expected falsifier 
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behaviour. Higher item-non-response and extreme ratios 
lead to a higher probability to observe an honest interviewer 
as does a lower y’ -value. The estimated coefficient for the 
others ratio is negative. Thus an increase in the others ratio 
ceteris paribus raises the probability that an interviewer has 
fabricated data. This might appear as a contradiction to our 
above-mentioned hypotheses. One possible explanation 
might be that the effect of the others ratio is already cap- 
tured by the item-non-response ratio. In fact, the correlation 
coefficient between the two variables is quite high with a 
value of 0.71. The Wilks’ lambda of the discriminant analy- 
sis is Statistically significant on the 5%-level. 


Table 5 
Results of the discriminant analysis by interviewer 
Interviewer Predicted Actual Discriminant 
group group function 

Fl 1 -2.878 
F2 1 ] -3.376 
#3 2 1 -0.541 
F4 1 ] -1.955 
H1 2 2 1.828 
H2 2 2 1.060 
H3 2) 2 1.747 
H4 D 2 1.616 
HS5 D 2 0.706 
H6 2 2 0.777 
H7 2 2 -0.041 
H8 D, 2 1.765 
H9 2 2 -0.710 

Table 6 


Standardized and non-standardized estimated coefficients 
(discriminant analysis) 


Variable Coefficient Coefficient 
(non-standardized) (standardized) 
Item-Non-Response 0.767 0.917 
Others -0.025 -0.129 
Extreme 0.075 0.821 
> -value -0.092 -0.562 
Constant -4.250 - 
Wilks’ lambda (Prob > F) 0.0254 


4. Conclusion 


Survey data are potentially affected by interviewers who 
fabricate data. Data fabrication is a non-negligible problem 
as it can cause severe biases. Even a small amount of 
fraudulent data might seriously impair the results of further 
empirical analysis. We extend previous approaches to 
identify at risk interviewers by combining several indicators 
derived directly from the survey data by means of cluster 
analysis. To demonstrate our approach, we apply it to a 
small dataset which was partialy fabricated by falsifiers. The 
fact that we know the falsifiers from the outset allows us to 
evaluate the results of the cluster analysis and to furthermore 
conduct a discriminant analysis to reveal how well the two 
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groups of interviewers can be separated by the indicator 
variables. Different types of cluster analyses are conducted. 
All of them lead to the identification of an at risk inter- 
viewer cluster, with the item-non-response ratio and the 
others ratio being the clearest indicators. We are not able to 
identify falsifiers perfectly. However, in all cases the at risk 
interviewer contains a much higher share of falsifiers than 
the second cluster. The advantage of clustering is that one 
obtains a clear classification of interviewers who are at risk 
and the other interviewers, something that is not the case 
when indicators like the y°-value are examined separately. 
Furthermore, it allows us to combine the information of 
several indicators. By investigating the performance of all 
possible subsets of indicators we find that generally a larger 
number of indicators is more apt to identify falsifiers. The 
fact that different clustering methods lead to different results 
should not necessarily be considered a shortcoming of our 
approach. Depending on how one weights the costs of an 
undetected falsifier relative to a false alarm, one might 
finally assign only those interviewers to the potential 
falsifier group that always fall into the at risk cluster, no 
matter what clustering method is applied (which would 
imply high costs of false alarms), one might assign all 
interviewers to the potential falsifier group that fall into the 
at risk cluster at least once (which would imply high costs of 
undetected falsifiers) or choose a solution in between. 

The application to a small dataset demonstrates another 
merit of our approach: it was tested and worked well in a 
situation in which the number of questionnaires per inter- 
viewer was quite limited (three of the falsifiers only sub- 
mitted 10 questionnaires). If a small number of question- 
naires per interviewer is sufficient to perform the analysis, 
one might also think about implementing it during the main 
field period when interviewers have only submitted a certain 
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percentage of their questionnaires. Falsifiers could then be 
replaced by other interviewers who survey the units that 
should have been surveyed by the falsifiers. 

Of course, when examining our results one has to keep in 
mind that we applied our method to a dataset in which a 
very severe form of data fabrication occurred: on the one 
hand we have falsifiers that faked all of their questionnaires 
(nearly) completely, on the other hand we have interviewers 
that (presumably) did all of their work honestly, which eases 
the discrimination between honest interviewers dishonest 
interviewers. Furthermore, with 13 interviewers, the size of 
our sample is quite limited. It would be interesting to 
explore the usefulness of our approach applied to larger 
datasets, given that the share of falsified interviews in large 
surveys has been found to be smaller than in our case. 
Additionally, larger datasets might allow the construction of 
additional indicators for the cluster analysis. If the survey 
has a reinterview program it would be possible to evaluate 
the usefulness of our approach by comparing the ‘success’ 
of a random reinterview with the success of a reinterview 
focusing on interviewers that were labeled as being at risk. 
We also intend to pursue the analysis in an experimental 
setting. An appropriate setting can ensure that one obtains a 
dataset which was partly collected by conducting real inter- 
views and partly fabricated by telling some participants in 
the experiment to fill their questionnaires themselves. 
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Appendix 


Table 7 
Results of the cluster analyses based on Equations 3 and 4 for all possible cluster combinations 


Indicators Equation 3 Equation 4 

Item-Non-Response Others Extreme x? Undetected falsifiers False Alarms _|Undetected falsifiers False Alarms 

0 | ] 

x 1 0) 2 

x 0 1? 0 

X 4 0 4 

x 0 0 2 

X xX 0 0 3 

Xx Xx x i} 0 ! ] 

xX 0! 4 0 4 

x x p) | 0 2 

x xX 3 0 ~ - 

x x x L 0 1 1 

xX x 0! 4 0 4 

xX xX x ] 1 0 2 

x xX x 0! 4 0 4 

x xX Xx x We 0 0! 1 


' Indicator composition not Pareto dominated. 


> Mean cluster values did not allow for an identification of the ‘at risk’ cluster. 
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The application of graph theory to the 
development and testing of survey instruments 


Steven Elliott ' 


Abstract 


This paper focuses on the application of graph theory to the development and testing of survey research instruments. A 
graph-theoretic approach offers several advantages over conventional approaches in the structure and features of a 
specifications system for research instruments, especially for large, computer-assisted instruments. One advantage is to 
verify the connectedness of all components and a second advantage is the ability to simulate an instrument. This approach 
also allows for the generation of measures to describe an instrument such as the number of routes and paths. The concept of 
a ‘basis’ is discussed in the context of software testing. A basis is the smallest set of paths within an instrument which 
covers all link-and-node pairings. These paths may be used as an economic and comprehensive set of test cases for 


instrument testing. 


Key Words: Graph theory; Computer Assisted Interviewing (CAI); Questionnaire development; Software testing; 


Basis testing; Test cases. 


1. Introduction 


Graph theory is a branch of mathematics which deals 
with collections of nodes and links. A visual representation 
of a collection of nodes and links is referred to as a ‘graph’. 
Graphs have been used in many areas of study to model 
real-world phenomena. The earliest examples appear in the 
analysis of transportation logistics (Berge 1976, page VII). 
In such analyses, a graph-theoretic approach is useful for 
determining such things as a maximally efficient set of paths 
to cover a number of locations. The locations are repre- 
sented by the nodes of the graph, and the links represent 
routes from one location to another. 

Graph theory has applications also in survey method- 
ology. If the questions in a survey questionnaire are repre- 
sented as nodes and the routes of flow between questions 
are represented as links, then a graph may be used to model 
a questionnaire. As such, many of the theorems and descrip- 
tive measures from graph theory pertain to questionnaires. 
In addition, the processes of documenting and testing survey 
instruments benefit from a graph-theoretic approach. For 
example, a documentation system that contains one table for 
questions and another for response alternatives has the 
ability to verify the connectedness of all instrument compo- 
nents as well as perform simulations of a working instru- 
ment. A testing procedure in which the set of test cases 
minimally spans the ‘basis’ of an instrument graph guar- 
antees that all combinations of consecutive links and nodes 
are tested with the smallest possible number of cases. 

A graph-theoretic representation is not necessary for the 
development, documentation, or testing of most survey 
instruments. In most cases, survey instruments have 
relatively few questions and the routing through an 


instrument does not have many branching points. Examples 
of this are customer satisfaction surveys and short, paper- 
and-pencil surveys such as the U.S. Census. For these types 
of instruments, conventional documentation and _ testing 
procedures are adequate. However, large and complex 
surveys, like many current survey efforts, may benefit from 
a graph-theoretic approach. For example, the Canadian 
Financial Capability Survey (CFCS) is a survey that was 
conducted in 2009 to determine Canadians’ knowledge and 
behavior with respect to financial decision making. It was a 
computer-assisted telephone interview comprised of 12 
sections each of which had approximately 12 questions 
(Statistics Canada 2010). Another example is the Consumer 
Expenditure Surveys Quarterly Interview CAPI Survey 
(2010) conducted by the United States Department of 
Labor, Bureau of Labor Statistics. This survey has 22 
sections most of which have 3 or more subsections, and 
within each subsection there may be as few as six or as 
many as 90 questions (US Bureau of Labor Statistics 
2010). Either of these examples would be a good candidate 
for a graph-theoretic approach to documentation and 
testing. 

This paper addresses the application of mathematical 
graph theory to survey research instruments. The next sec- 
tion of the paper which follows immediately below contains 
a description of a questionnaire as a graph and a delineation 
of the special properties that set apart a questionnaire graph 
from other types of graphs. The third section outlines the 
implications of a graph-theoretic representation on the 
structure of databases used for documentation/specifications 
systems for computer-assisted surveys. In Section 4, the 
specific features of graph-theoretic data structures are 
discussed. Sections 5 and 6 pertain to software testing and 
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the implications of graph theory on testing. A rationale is 
presented for the use of a ‘basis’ set of test cases which 
covers all pairs of linked nodes. This set of paths constitutes 
a comprehensive set of test cases for instrument testing. 


2. A questionnaire as a graph 


A graph may be represented as follows: G = (V, E), 
whet \4V =1vj RvesV evs} its wawsetinoiinodes Yor 
vertices and E = {(v;, V;), (V;, Vy), «-f is a set of links or 
relations between pairs of vertices. Links are referred to as 
‘edges’ in the terminology of graph theory, and hence the 
common usage of “E” to represent them (Chartrand 1985, 
page 27). A graph need not have any additional special 
characteristics. However, graphs which are attributed special 
characteristics are useful in modeling many phenomena in 
science and engineering. For example, graphs with un- 
directed edges (i.e., where both of the nodes attached to a 
link may be a predecessor or successor) may be used to 
model AC electric circuits, and graphs with directed edges 
may be used to model problems in traffic-pattern design. 
Other graphs with special characteristics are utilized to 
model networks in computer science, communications, 
sociology, and psychology. 

In the case of survey questionnaires, the nodes of the 
graph represent different components or parts of a survey 
instrument. Most frequently, these are the substantive 
questions of a survey or decision points where routing 1s 
determined. The edges represent the response alternatives or 
outcomes associated with a node. Edges also represent the 
routing from one node to the next, and each edge has a 
unique predecessor and successor node. The graph depicted 
in Figure | represents a simple, 12-question survey instru- 
ment. The black circles (i.e., nodes) represent the compo- 
nents of the instrument, and the lines connecting the black 
circles represent the edges that join one question to another. 
For example, the first node could represent a question with 
two response alternatives such as ‘yes’ and ‘no’. The second 
node could represent a question with five response alter- 
natives, where the first three alternatives branch to node 3, 
and the fourth and fifth alternatives branch to node 4. 

When a graph is used to represent a questionnaire, there 
are a number of special properties that are attributed to the 
graph. These properties define the logical nature of a ques- 
tionnaire. Bethlehem and Hundepool (2004) pointed out a 
number of these properties. First, a questionnaire has a 
starting node and an ending node. Second, all nodes other 
than the starting and ending nodes are connected. This 
means that for each node in the graph there is at least one 
route to it from the starting node, and one route away from it 
to the ending node. A third property of a questionnaire 
graph is that each of the edges is directed. This means that 


Statistics Canada, Catalogue No. 12-001-X 


the route of flow from one node to another is always in one 
direction. A fourth characteristic of a questionnaire graph is 
that it may have multiple edges between a single pair of 
nodes. Many types of graphs are restricted such that only 
one edge may join a pair of nodes. This restriction does not 
apply to a questionnaire graph, because questionnaires 
commonly have more than one response alternative leading 
from one question to another. A final characteristic is that 
looping structures are permitted. This means that a node 
may appear multiple times on a single route. Looping struc- 
tures are used frequently in questionnaires to modify re- 
sponses that are determined to be incorrect. For example, 
financial or time-usage questions may be checked with edits 
that loop back if component questions do not sum to the 
correct total. 


Figure 1 Representation of a Survey Instrument as a Graph 


The characteristics of a questionnaire graph may be 
summarized as follows: 

1. a starting node and an ending node, 

2. connectedness (i.e., each node is connected to the 
start and end nodes), 

3. all edges are directed, 

4. pairs of nodes may have multiple or parallel edges 
connecting them, and 

5. nodes may appear more than once on a route. 


Given a set of defining properties, it is possible to determine 
a number of descriptors including the number of routes and 
a basis. It is possible also to model a documentation system 
on the structure of the graph as illustrated in the next 
section. 


3. Documentation and specification 
systems for survey questionnaires 


Questionnaire documentation systems are typically one 
of two types: a text document or a relational database. For 
text-document systems, the information pertaining to a 
substantive question or other type of instrument component 
is most often presented as a section of the document. It 
consists of the question text, response alternatives, routing, 
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and instructions for programmers. The documentation 
system itself has no functionality aside from the search and 
print capabilities available in the word-processing software 
used to create the documentation. Systems using a relational 
database, on the other hand, are typically structured as a 
table where the rows represent the questions of the survey, 
and the columns represent attributes of the questions. Each 
record in the table is an n-tuple of question attributes. For 
example, the attributes of a question might include: name, 
sequence number, text of the question, response alternatives, 
routing information, and technical notes. One such speci- 
fications system is the Tool for the Analysis and Documen- 
tation of Electronic Questionnaires (TADEQ) (Bethlehem 
and Hundepool 2004). Other examples include systems 
developed at Westat Inc. for the Medicare Current Benefi- 
ciary Survey (MCBS) sponsored by the US Centers for 
Medicare and Medicaid Services (Medicare Current Benefi- 
ciary Survey: Overview 2010) and the Medical Expenditure 
Panel Survey (MEPS) sponsored by the US Department of 
Health and Human Services (MEPS: Survey Instruments 
and Associated Documentation 2010). These database 
systems have in common a structure of one primary table 
where each record represents a question. 

Despite the advantages afforded by the straightforward 
nature of conventional systems, a specification system 
modeled like a graph has capabilities beyond those possible 
with a conventional structure. Before describing those 
capabilities and the necessary underlying structure, it should 
be noted that there are multiple ways in which a graph- 
theoretic data structure may be constructed (the interested 
reader is referred to Gibbons (1985, page 73) who described 
and categorized a number of those structures). The system 
proposed here is a relational list structure with two primary 
tables. One table represents the nodes of the graph, and the 
second table represents the edges. In the table representing 
nodes, each record or row represents an individual instru- 
ment component (i.e., survey question, edit, or routing 
decision point). The second table represents edges where 
each record represents an individual edge (i.e., a response 
alternative or a specific condition existing at a decision 
point). Each record from either table contains attributes 
associated with the record. Individual attributes are con- 
tained in the columns of the table. In the table of nodes, each 
column represents a specific attribute such as the component 
ID and component type. In the table of edges, each column 
represents an attribute such as the text of a response alter- 
native. Two important distinctions between a documentation 
system with this structure versus a more conventional docu- 
mentation system are: 1) the information pertaining to edges 
is not contained in the table for instrument components and 
2) the table of edges (i.e., links) contains identifiers for the 
predecessor and successor of an edge. As described in the 
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next section, these distinctions allow a documentation 
system to perform in ways not possible with conventional 
systems. 


4. Features of a graph-based specifications system 


The use of separate tables for nodes and links as the 
building blocks of a specifications systems has several 
advantages. Most important of these advantages is the 
ability to simulate an interview. A developer or tester can 
move through an instrument selecting response alternatives 
while being routed from one instrument component to 
another just as if they were administering the instrument to a 
respondent. Figure 2 is an example of a screen display for 
simulating an instrument. The component from which 
simulation begins is selected from this screen. Figure 3 is 
the actual simulation screen itself. It shows the current 
component with the question text or conditional in the 
center of the screen. The lower left is a display of all 
components from which one may have come in order to 
arrive at the current component (i.e., predecessors). These 
are referred to as ‘origination points’ in the screen display. 
The lower right is a display of destination points or compo- 
nents to which one may go from the current component (i.e., 
successors). Thus, one may move through an instrument one 
component at a time in either direction by selecting either an 
origination point or a destination point. In Figures 2 and 3, 
the questionnaire used as an example is one on general 
knowledge about cancer, and the question depicted in 
Figure 4 has only one predecessor and one successor. This 
will be the case for most survey questions, however if 
multiple predecessors or successors did exist, they would be 
listed in the display. 

The ability to simulate the operation of a survey instru- 
ment is made possible because a separate table is utilized for 
links. This table may be queried to find all predecessors and 
successors for any component in the questionnaire. During 
the design phase of development, this feature can be used to 
insure that all sections and questions are properly connected 
and all routing is correct. In the testing phase of devel- 
opment, this feature may be used to perform side-by-side 
comparisons of an instrument and the specifications upon 
which it was built. A tester could have the specifications 
system simulating the instrument on one monitor while 
running the actual instrument on a second. Such compari- 
sons can be used to check not only the wording and format- 
ting of questions and response alternatives, but also to verify 
that the instrument is going to the appropriate question at the 
appropriate time. Reports of errors or problems may then be 
entered directly into the specification system as an attribute 
of an instrument component. 
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_ Simulate the Instrument _ 


Project | General Cancer Knowledge instrument |Instrument1 = == 
Component ID: | 


Figure 2 Begin simulation screen 


Simulate: [General Cancer Knowledge | Instrument1 =) Cormpanent Narne:[CK-3 
sequence Na: | Descrintion Type: [Field ~| 


CK-3. And which of the four remaining illnesses causes the second greatest number of deaths? [NOTE: Display the four 
response alternatives not selected in the previous question in the same order as presented in the previous question.] 


ap dlnstrument 5 fon instrument 


Component 


Instrument 1 K-2 Jlnstrument 1 CK-4a 
0 To Ofqination So To Destination 


‘View Response 


Alternatives 


Figure 3 Simulation screen 
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Another method for evaluating the integrity of a ques- 
tionnaire is to identify ‘orphan’ instrument components. 
Sometimes in the course of creating or modifying a ques- 
tionnaire, an instrument component may become inac- 
cessible. Such components are referred to as ‘orphans’. 
Since a table exists for links (i.e., response alternatives and 
conditions), it is possible to run queries on this table to 
determine if a particular question appears as the successor to 
any link. If the question does not appear as a successor, then 
it is an orphan. Figure 4 contains the screen display for a 
listing of instrument components sorted by the frequency 
with which each appears as a successor. This is called an 
‘Orphan Report’ in the figure. It shows that the first question 
in the survey has no origination points. This is as it should 
be since the first component cannot have predecessors. Any 
other component having zero origination points is an 
orphan. The orphan report is useful also in characterizing 
instrument components. For example, a question or compo- 
nent with a large number of originations may be the first 
question of a section devoted to handling premature 
terminations. Such a section is accessible from any other 
section of the interview, and therefore it would have a large 
number of predecessors. 


5. Testing 


Testing a computer-assisted survey instrument is the 
process of verifying that the behavior of the instrument is 
consistent with the design specifications. Several ap- 
proaches have been utilized to accomplish this. One is to 
test first the building block components of a system, and 


2 Orphan Report — - 
General Cancer Knowledge 


$5 


then move to increasingly larger and more integrated assem- 
blages of components (i.e., ‘bottom-up’ testing). Testing the 
building block components is referred to as ‘unit testing’ 
(Beizer 1995, page 5). After each of the building blocks has 
been tested separately, the blocks are assembled, and testing 
is concentrated on how the components interact. This is 
referred to as “integration testing’ (Hetzel 1984, page 11). 
The final stage of integration testing is ‘system testing’ 
where the entire system as a whole just as it would be used 
in a true production environment (Myers 1979, page 110). 

Other approaches and terminology have also been ap- 
plied to testing procedures. These include ‘black-box’, 
‘white-box’, and ‘regression’ testing. In black-box testing, a 
program is treated as if it were in a black box where the 
inner workings not visible. Inputs and outputs are the only 
observable aspects of program function (Beizer 1995, page 
8). White-box testing utilizes knowledge of the program 
code to decide how to conduct the tests and which cases are 
used in testing (Patton 2006, page 55). For example, a 
programmer might conduct a series of white-box tests such 
that every line of code is ‘exercised’ (i.e., ‘code coverage’) 
or such that every branching point is exercised (i.e., “branch 
coverage’). Regression testing is used to insure code integ- 
rity after changes or additions have been introduced to an 
operational program (Beizer 1995, page 235). Regression 
tests utilize a set of test cases. This set is selected such that 
each of the major branches of the program is exercised. 
Other types of testing (e.g., alpha, beta, usability) are also 
used in software development, and there are many sources 
for a more comprehensive description of testing procedures 
(see Kaner, Falk and Nguyen 1999, page 277). 


ge Sequence | Instrument Component 
Instrument 1 CK-1 


Instrument 1 
Instrument 1 


Instrument 1 


Instrument 1 


Instrument 1 


Instrument 1 


Instrument 1 
Instrument 1 


Instrument 1 
Instrument 1 


Figure 4 Orphan report 
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In any testing procedure, a major concern is testing bias. 
This results when some components or functionality of an 
instrument are excluded from testing. For example, ques- 
tions which appear toward the end of a survey or in an 
obscure section may be more likely to be excluded. Testing 
bias is eliminated completely if a set of test cases is selected 
such that all instrument components, links between compo- 
nents, and aspects of functionality are included. However, 
given the length and complexity of some surveys, compre- 
hensive testing is not a practical option. Consider, for 
example, the questionnaire represented in Figure |. This 
questionnaire has only 12 questions and 28 response 
alternatives, and yet, there are 672 possible routes through 
the instrument. In large surveys such as those mentioned 
above, the number of routes could be well over 10,000. 
Thus, if comprehensive coverage is not a viable approach 
for large surveys, it is possible to avoid testing bias by 
taking a probability sample of potential test cases. A graph- 
theoretic approach can be useful in both the specification of 
the universe of test cases and in the determination of a 
rational approach to sampling test cases. 


6. A graph-theoretic approach to testing 


A universe of test elements can be defined in several 
different ways. One could use the elements already dis- 
cussed - test cases, where each case is a mock interview. 
Alternatively, a universe of test elements could be survey 
questions, response alternatives, or any of a variety of 
combinations of questions and response alternatives. The 
discussion here is limited to test cases, and therefore, it will 
be helpful to provide precise definitions of a test case and 
two closely related terms, ‘path’ and ‘route’. 

A path is a unique, ordered set of nodes, which traverses 
an instrument from beginning to end. Each node in a given 
path, provided that it is not a starting or ending node, is 
linked to a predecessor and a successor (this definition is 
consistent with Bethlehem and Hundepool 2004). A unique 
path results whenever a component has more than one 
successor component. In Figure 1, multiple successors 
appear for components 2 and 4. These two branching nodes 
result in three paths: 


Pathyl- 1) 2y3,5,7, Dll 
Ball 2m lee sone Os Lt 
Path 3 - 1, 2,4, 6, 10, 11, 12 


A ‘route’, on the other hand, is a unique, alternating 
series of nodes and links beginning with the starting node 
and terminating with the ending node. Like a path, a route 
must satisfy the properties of connectedness and direction. 
‘Route’ is the graph-theoretic term which is synonymous 
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with what is commonly called a ‘test case’ in software 
testing. Since a route takes into account which link connects 
a pair of nodes, the number of routes in a graph is greater 
than or equal to the number of paths. The number of routes 
contained within a particular a path is equal to the product of 
the number of links between each pair of nodes along the 
path. Thus for the example in Figure 1, the number of routes 
for each path is: 


Path 1 -2x3x2x2x2x2x3=288 
Path 2c 2x3 x2x 2x2 «% 2% 3=286 
Path-3)- 2x 2x2 x2 x 2x3— 96 


The total number of routes is the sum of routes over all 
paths (i.e., 288 + 288 + 96 = 672). A formula for computing 
the number of routes is: 


Routes = pate ke links,, 


where i represents the i” path, P represents the total 
number of paths, j represents the j" set of links on a_ 
given path, NP, represents the number of pairs of connected 
nodes on a given path, and Jinks represents the number of 
links connecting a pair of nodes. 

If a testing protocol is based on a sample of routes, then a 
minimum and comprehensive suite or universe of test cases _ 
is contained in the ‘basis’ of a graph. The term, “basis’, in 
this context is analogous to a ‘basis’ in geometry. The basis _ 
of a geometric space is a set of vectors which is sufficient to 
span the space, or in other words, a basis is a set of vectors — 
sufficient to locate any point in the space. Likewise, the | 
basis of a graph is a set of paths sufficient to include all | 
predecessor-successor pairings of nodes. This implies that | 

| 


all nodes and at least one of the links between any | 
connected pair of nodes are included. A basis is a subset of | 
all possible paths. All questionnaires have a set of paths (P) | 
in which each member satisfies the definition of a path as 
stated above (i.e., a unique sequence of nodes). Within this _ 
set is a subset which has the special characteristic that each 
member path contains at least one pair of connected nodes _ 
that is not contained in any other path within the subset. 
This subset will be referred to as ‘basis paths’ (BP). | 

In order to gain a better understanding of the difference | 
between the paths in BP and those in the complement of BP 
(i.e., P — BP), consider the graph presented in Figure 5. The 
set of all paths (P) for the graph in Figure 5 is: 


Path 1 - 1,2, 4,5, 7 
Path = 174, Go 7 
Path 3 - 1, 3,4,5,7 
Path 4 - 1, 3, 4, 6, 7 


| 
| 


Any one of the four paths could be eliminated and the. 


remaining three would include each pair of connected: 
| 
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nodes, and therefore any three constitutes a set of basis paths 
(BP). For example if Path 1 were eliminated, each of the 
node pairings would still be contained in Paths 2, 3, and 4. 
However, if both Paths 1 and 2 were eliminated, then node 
pairings | - 2 and 2 - 4 would be excluded. Thus, the set of 
two paths would be insufficient to span all of the inde- 
pendent sequences of nodes in the graph. 


Figure 5 Representation of paths and basis paths 


As illustrated above in Figure 1, many questionnaires 
encountered in practice have so many routes that testing all 
routes is not practical. Further, a typical route within an 
instrument has one or more similar routes which involve the 
same set of nodes, and these routes may be so similar that 
they differ by only a single, parallel link. Therefore, testing 
all routes would be not only impractical due to the large 
number of routes, but also redundant due to the similarity of 
many routes. The task for a test designer is to select a subset 
of routes that maximizes coverage and minimizes redun- 
dancy. This may be accomplished by using BP as a first step 
in sampling from the universe of routes. The utilization of 
BP in this manner is equivalent to beginning the sampling 
process with a purposive sample (Cochran 1977, page 10). 
Another way to think of this first step is as a redefinition of 
the universe of elements for the purpose of eliminating 
redundancy. This universe is comprehensive in its coverage, 
and it contains the smallest set of cases necessary to include 
all connected node pairs. A second stage of sampling could 
then be to select one or more routes from each of the paths 
contained in BP. This could be accomplished in several 
ways. One way would be to consider each path as a cluster 
of test cases and then take a probability sample from each 
cluster. Another way would be to select one route from each 
cluster by randomly selecting one parallel link at each node. 

If one accepts the notion of basis testing, then it must be 
determined how much of the basis should be tested. If all 
paths in BP are tested, then the only elements of an instru- 
ment excluded from testing are redundant links. While 
redundant links may contain spelling or formatting errors, 
they are unlikely to contain routing errors. This stems from 
the nature of the programming task involved in creating 


a7 


CAI instruments. Response alternatives are typically ‘bun- 
dled’ in the sense that alternatives which lead to the same 
next question are likely to be either all misdirected or none 
misdirected. For this reason, comprehensive testing of a 
basis is an effective method for minimizing errors of a type 
most likely to lead to loss of data. 

On the other hand, non-comprehensive testing may be 
the only reasonable strategy if constraints due to time or 
level of effort exist and the number of paths in a basis is 
large. Despite the fact that any part of an instrument not 
tested may contain an error, any fraction of the paths in a 
basis may constitute an unbiased test. Thus, the percentage 
of paths to be included in a test should probably depend on 
factors specific to a particular development situation. For 
example, an instrument may contain modules which have 
been used previously or modules that have had only minor 
modification since previous use. These modules need not be 
tested as thoroughly as newer ones. As a general rule, a 
minimum sample of test cases should include each distinct 
section of an instrument in one or more paths, and paths 
should be included to cover all inter-sectional connections. 


7. Discussion and conclusions 


A graph-theoretic approach to software development has 
two major advantages over conventional approaches. First, 
it allows for a documentation system that can simulate the 
behavior of a computer-assisted interview. This is useful in 
verification of routing and as an aid to testers in side-by-side 
comparisons of instrument behavior versus design specifi- 
cations. The second major advantage is in selecting cases 
for testing. The use of the basis of a questionnaire allows for 
the specification of a universe of test cases which covers all 
node pairings with a minimum number of paths. Probability 
sampling from this universe insures that no bias is incorpo- 
rated into the testing procedures. 

In practice, the first advantage can be achieved by 
structuring the database behind a specifications system 
such that it contains a table for nodes and a table for links. 
If the links table specifies a predecessor and a successor 
node, then queries of the tables will provide the func- 
tionality for verification of routing and simulation. The 
second advantage can be achieved with an algorithm for 
the identification of a basis. As pointed out by Poole 
(1995), one of the most important things to do when 
setting out to test software is to determine which test cases 
to use. He presented an algorithm for doing this that is 
based on the flowgraph of a program. Using a flowgraph 
for this purpose is useful as long as the program is not too 
large. With large and complicated programs, flow diagrams 
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become unwieldy. The same is true of large and compli- 
cated questionnaires (Bethlehem and Hundepool 2004). 
The appendix contains output from an algorithm which 
generates a basis, counts routes, and specifies basis paths 
for an example questionnaire graph (the algorithm used to 
generate the output appearing in the appendix is available 
from the author (sdelliott2@verizon.net). This algorithm 
does not handle looping structures as would be inherent in 
edits or ‘go back’ features. These structures may be tested as 
separate from the questionnaire graph. An algorithm which 
handles looping is under development). 

A graph-theoretic approach is valuable also in that it 
allows for the use of a number descriptive measures of 
questionnaires such as the number of routes, the number of 
paths, cyclomatic complexity (cyclomatic complexity is a 
measure of complexity in software code (see Hetzel 1984; 
McCabe 1976; and Watson and McCabe 1996). It is equal 
also to the number of paths in the basis of a graph. For 
directed graphs where parallel links are not permitted, 
cyclomatic complexity (CC) = L—N+2, where L is 
the number of links and N_ is the number of nodes), and 
several types of descriptive matrices (see appendix). Future 
enhancements to a graph-theoretic approach will likely 
involve such things as: 1) taxonomies for components, links, 
and errors; 2) secondary tables in the specification database 
containing attributes specific to different types of nodes and 
links; 3) sophisticated sampling plans for selecting test 
cases; and 4) purposive route sampling. 

Taxonomies will promote the specification of special 
types of instrument components and the incorporation of 
secondary tables in the documentation system. An example 
of a special type of instrument component is one with a 
randomization feature. Such a component would be used in 
multi-phase respondent selection where a respondent re- 
porting a particular disease, for example, has an increased 
probability of being routed to a follow-up section pertaining 
to that disease. In this case, the initial question pertaining to 
the disease may be a special type called ‘respondent 
selection’. A secondary table in the documentation system 
for ‘respondent selection’ questions may have attributes 
pertaining to a random number generator such as generator 
seed and selection threshold. 

Enhancements to sampling may include stratified sam- 
pling (Cochran 1977, page 89) and sampling with probabi- 
lity proportional to size (i.e., PPS). Stratified sampling could 
be used to insure that all sections within a questionnaire are 
included with certainty. Paths would be stratified according 
to the sections they traverse. With PPS sampling, size might 
be a measure of path length, and the probability of selection 
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for a particular path would be dependent on the number of 
nodes included in the path. Thus, longer paths could be 
included with greater frequency. Purposive route sampling 
may be utilized for testing instrument characteristics other 
than programming errors. For example, later phases of 
questionnaire development might target specific sequences 
of questions for tests of the cognitive characteristics of an 
instrument. 

Other researchers in this area likely will provide further 
enhancements to the application of graph theory to question- 
naire development. It does seem clear that graph theory 
lends itself well to the description, development, and testing 
of complex CAI instruments. The current trends in CAI 
usage seem to be in the direction of more sophisticated and 
larger instruments. For this reason, tools which help to 
document instrument components and identify errors are 
valuable to development efforts. 
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Appendix 


Example of Basis Generation 


Bold numbers are nodes. Non-bold numbers 
represent the number of links connecting a pair 
of nodes. 


Links (i.e., excluding redundant links) = 23 
Nodes = 16 


Figure 6 Questionnaire graph 
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Table 1 
Branches count for each node 
Node Number 2 3 4 5 6 ei) 8 9 10 i 12 13 14 15 16 
Number of Branches l 3 l 2 2 1 2, 4 1 | l 1 | | l 0 
Table 2 
Link matrix 
Node 1 2 3 4 5 6 Tl 8 9 10 11 12 13 14 15 16 
2 
2 2: 4 
3 
4 ?) 


AA BNW WW 


] 
2 
3 
4 
> 
6 
i 
8 
9 
10 
11 
re 
13 
14 
is 
16 


Each cell contains a value for the number of links between the row and column nodes. 


Table 3 
Path matrix 


node 2 node 3% node 4"node 5 node 6"node 7"node 8 "node 9 "node 10 node 


Path 1 1 Z 3 6 ) 16 
Path 2 ] Ds 4 6 9 16 
Path 3 ] zz e) vi 10 16 
Path 4 ] Z 4 i 10 16 
Path 5 l 2 5 8 1 16 
Path 6 l 2 5 7 1 16 
Path 7 2 4 7 1] 16 
Path 8 Z =) 8 13 16 
Path 9 ] 2 5 8 14 16 
Path 10 l 2 5 8 15 16 


Cell values represent nodes. Each row represents a path. 
[Note: The paths in this example all have 6 nodes. However in general, all paths will not have the same number of nodes. | 


Table 4 
Link counts and number of routes for each path 


Node Pairings Routes 


4" to 5th 5th to 6" 


st to ana ane to 3rd ard to 4m 
Path 1 
Path 2 
Path 3 
Path 4 
Path 5 
Path 6 
Path 7 
Path 8 
Path 9 


Path 10 


Paths = 10 Total Routes = 1,600 
Cells represent the number of links between successive nodes in a path. 
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Table 5 
Basis path matrix 
1node 2" node 3' node 4"node 5°node 6"node 7"node 8 "node  9"node 10" node 
Basis Path 1 ] 2 3 6 y) 16 
Basis Path 2 ] 2 4 6 wy 16 
Basis Path 3 l 2 5 7 10 16 
Basis Path 4 | 2 4 v 10 16 
Basis Path 5 ] 2 5 8 [2 16 
Basis Path 6 ] 2 5 7 1] 16 
Basis Path 7 ] y) 5 8 13 16 
Basis Path 8 ] 2 5 8 14 16 
Basis Path 9 l 2 3 8 15 16 
Cell values represent nodes. Each row represents a basis path. 
Table 6 
Link counts and number of routes for each basis path 
Node Pairings Routes 
1° to and pnd to ard grd to qth ath to 5th 5h to 6" 
Basis Path 1 2 2 3 4 2 96 
Basis Path 2 2 2 4 4 2 128 
Basis Path 3 2 4 3 ) 3 360 
Basis Path 4 2 2 ps 5 3 120 
Basis Path 5 Z A 2 4 2 128 
Basis Path 6 2 4 3 + 3 288 
Basis Path 7 2 4 2 Z 4 128 
Basis Path 8 Z 4 2 2 4 128 
Basis Path 9 Z 4 2 2 4 128 


Basis Paths = 9 Total Routes in Basis = 1,504 


Cells represent the number of links between successive nodes from the Basis Paths Matrix above. 
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On sample allocation for efficient domain estimation 


G. Hussain Choudhry, J.N.K. Rao and Michael A. Hidiroglou ' 


Abstract 


Sample allocation issues are studied in the context of estimating sub-population (stratum or domain) means as well as the 
aggregate population mean under stratified simple random sampling. A non-linear programming method is used to obtain 
“optimal” sample allocation to strata that minimizes the total sample size subject to specified tolerances on the coefficient of 
variation of the estimators of strata means and the population mean. The resulting total sample size is then used to determine 
sample allocations for the methods of Costa, Satorra and Ventura (2004) based on compromise allocation and Longford 
(2006) based on specified “inferential priorities”. In addition, we study sample allocation to strata when reliability 
requirements for domains, cutting across strata, are also specified. Performance of the three methods is studied using data 
from Statistics Canada’s Monthly Retail Trade Survey (MRTS) of single establishments. 


Key Words: Composite estimators; 
programming. 


1. Introduction 


Stratified simple random sampling is widely used in 
business surveys and other establishment surveys em- 
ploying list frames. The population mean Y = ¥,W,Y, is 
estimated by the weighted sample mean y,, = >, W, y,, 
where W, = N,,/N_ is the relative size of stratum / (= 1...., 
L) and Y, and y, are the stratum population mean and 
sample mean respectively. The well-known Neyman sample 
allocation to strata is optimal for estimating the population 
mean in the sense of minimizing the variance of y,, subject 
to +, ”, =n where nv is fixed or minimizing >, n, 
subject to fixed variance of y,, where n, is the stratum 
sample size. But the Neyman allocation may cause some 
strata to have large coefficients of variation (CV) of the 
means y,. On the other hand, equal sample allocation, 
n, =n/L, is efficient for estimating strata means, but it 
may lead to a much larger CV of the estimator y,, com- 
pared to that of Neyman allocation. 

Bankier (1988) proposed a “power allocation” as a com- 
promise between Neyman allocation and equal allocation. 
Letting C, = S,/Y, be the stratum CV, the power alloca- 
tion is 

np = Aird Nata Wane 
23 C, x h 


where X,, is some measure of size or importance of stratum 
h and q is a tuning constant. Power allocation (1.1) is 
obtained by minimizing >),{X{/CV(¥,)}’ subject to 
Yn, =n, Where CV(y,) is the CV of the stratum 
sample mean y,. The choice g =1 and XY, =N,¥Y, in 
(1.1) leads to Neyman allocation 


(1.1) 


Compromise allocation; 


Direct estimators; Domain means; Non-linear 


i =in2 A hretinon, 


(1.2) 


and g =0 gives equal allocation if C, =C for all A, 
where S; is the stratum variance. Bankier (1988) viewed 
values of g between 0 and | as providing compromise 
allocations. He gave a numerical example to illustrate how 
q may be chosen in practice. The choice X, = N, and 
gq =1/2 in (1.1) gives “square root allocation” n, = 
nJN, / seis if C, = C. Power allocation (1.1) and 
some other allocations generally depend on the variable of 
interest y and hence in practice a proxy variable with known 
population values is used in place of y. 

Costa etal. (2004) proposed a compromise allocation 
based on a convex combination of proportional allocation, 
n, = nW,, and equal allocation n, = n/L, see section 2.1. 
Longford (2006) made a systematic study of allocation in 
stratified simple random sampling by introducing “‘infer- 
ential priorities” P, for the strata fA and G_ for the 
population. In particular, he assumed that F, = N/ for a 
specified g (0 < q < 2), see section 2.4. He also studied 
the case of small strata sample sizes n, in which case 
composite estimators of strata means Y, may be used. 

The main purpose of our paper is to propose an “opti- 
mal” allocation method, based on non-linear programming 
(NLP), see section 2.3. It minimizes the total sample size 
x, 7”, Subject to specified tolerances on the CVs of the 
strata sample means y, and the estimated population mean 
y, The case of indirect (composite) estimators of strata 
means is studied in Section 3. In Section 4, we study 
optimal sample allocation to strata when reliability require- 
ments for domains, cutting across strata, are also specified. 


1. G.H. Choudhry, Statistical Research and Innovation Division, Statistics Canada. E-mail: ghchoudhry@gmail.com; J.N.K. Rao, School of Mathematics 
and Statistics, Carleton University. E-mail: jrao@math.carleton.ca; M.A. Hidiroglou, Statistical Research and Innovation Division, Statistics Canada. 


E-mail: mike.hidiroglou@statcan.ge.ca. 
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The proposed method readily extends to multiple variables, 
but for simplicity we omit details. Using the optimal total 
sample size obtained from NLP, we make a numerical study 
of the performances of Costa ef al. and Longford methods in 
terms of satisfying reliability requirements, Section 5. 


2. Allocation for direct estimators 


In this section, we consider direct estimators, y,, of 
strata population means, assuming stratified simple random 
sampling. The case of indirect estimators of strata means is 
studied in Section 3. Indirect strata estimators are used in the 
case of strata with small sample sizes n,. 


2.1 Costa etal. allocation 


The sample allocation of Costa et al. (2004) is 


(2.1) 


for a specified constant k(0 < k < 1). This allocation 
reduces to equal allocation when & = 0 and to proportional 
allocation when k = 1. Formula (2.1) needs to be modified 
when n/L> WN, for some h ina set of strata A. The 
modified allocation is 


n. = k(nW,)+(1—k)(n/L) 


in = k(nW,) + (1—k) n,, (2.2) 


where ny =N, if he A and n i? A ne i hes 
m) otherwise, where m is the number of strata in the set 
A. Note that when 4 =0, (2.2) gives modified equal 
allocation. We study different choices of the constant k in 
the numerical study of Section 5, based on data from 
Statistics Canada’s Monthly Retail Trade Survey (MRTS). 


2.2 Longford allocation 


Longford’s (2006) method attempts to simultaneously 
control the reliability of the strata means y, and the 
estimated population mean y,, by minimizing the objective 
function 


L 
> PV) + (GP.)V Oy) 


h=| 


(2.3) 


with respect to the strata sample sizes n, subject to 
nn, =n, where P, = >,P,. The first component in (2.3) 
specifies relative importance, P,, of each stratum / while 
the second component attaches relative importance to y,, 
through the weight G. Longford (2006) assumed that 
P, = Nj for some constantq(0 < q < 2). The term P 
in (2.3) offsets the effect of the sizes P, and the number of 
strata on the weight G. 

Under stratified simple random sampling, the sample 
allocation minimizing (2.3) is 


Statistics Canada, Catalogue No. 12-001-X 


Seale. 
L=n VF h=1 L 


Mp vm See = wees 


where P’ = P. +GPW,. If q = 2, then (2.4) does not 
depend on the value of G and it reduces to Neyman 
allocation, ny , given by (1.2) 


(2.4) 


2.3. Nonlinear programming (NLP) allocation 


We now turn to the NLP method of determining the 
strata sample sizes n, subject to specified reliability re- 
quirements on both the strata sample means and the esti- 
mated population mean. Letting f= (f,,...,f,)' with f, = 
n,/N,, we minimize the total sample size 


L 
g(f) = DAN, (2.5) 
h=l 
with respect to f subject to 
CNG eae ete ee (2.6) 
CVG,,) CY, (2.7) 
Oe Pk BO | Ft NF (2.8) 


where CV,, and CV, are specified tolerances on the CV of 
the stratum sample mean y, and the estimated population 
mean y,,, respectively. Inequality signs are used in (2.6) 
and (2.7) because the resulting CVs for some strata h 
and/or for the aggregate may be smaller than the specified 
tolerances (Cochran 1977, page 122). 

Letting k, = f,', (2.5) becomes a separable convex 
function of the variables k 


h? 


18 
&(k) = DUN, k, (2.9) 
h=1 
We re-specify the constraints (2.6) and (2.7) in terms of 
relative variances so that the constraints are linear in the 


variables k,. The relative variance (RV) of y, is the square 
of its CV, 


= I Mar ele 
RV(j,) = ‘ Ge 


h 


(2.10) 


Similarly, the relative variance of y,, is the square of its 
CX 


Se (2.11) 


We used the SAS procedure NLP with the Newton- 
Raphson option to find the optimal &, that would minimize 
(2.9) subject to 


RVG;) < BM bal (2.12) 
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RV(y,,) < RV, (2.13) 


a Pe a (2.14) 


RV(y,,) and RV(y,,) are given by (2.10) and (2.11) where 
RV,, = CV,, and RV, =CV,. By expressing the con- 
straints as linear constraints and the objective function as a 
separable convex function, we achieve faster convergence 
of the re-formulated NLP. Denoting the solution to NLP as 
k° = (k, res aoa the corresponding vector of optimal 
strata sample sizes is given by n° = (n) O63 ioe where 
n, = N,,/k,. We can modify (2.14) to ensure that n? > 2 
for all h which permits unbiased variance estimation. 

The NLP method can be readily extended to multiple 
variables y,, ..., vp by specifying tolerances on the CVs of 
strata means and the estimated population mean for each 
variable (p = 1, ..., P). Ifthe number of variables P is not 
small, then the resulting optimal total sample size n° = 
Dh n may increase significantly relative to n° for a single 
variable. Huddleston, Claypool and Hocking (1970), Bethel 
(1989) and others studied NLP for optimal sample alloca- 
tion in the case of estimating population means of multiple 
variables under stratified random sampling. 


3. Allocation for composite estimators 


Longford (2006) studied composite estimators of strata 
means of the form 


(3.1) 


where y, is a synthetic estimator; here we take y; = Y,,. 
The MSE of 0, is 


0, =, Vp + (1 =a) Vz 


‘ E& x Se { nS; 
=a,> Wp*+-a,)W, 
h=\ Ny Ny 
S 270 V\2 
Ny, 


+ terms not depending on then,. (3.2) 


Longford (2006) showed that the optimal coefficient a, 
in (3.1) minimizing (3.2) is approximately equal to or = 
S;(S; +n, A;)', where A, = Y, —Y. He then replaced 
A’ in OL, by its average over the strata, denoted by OF = 
L'S,(¥,-Y), leading to a, ~ (l+n,@,)°, where o; = 
oft /S;. The resulting MSE of 0 , 18 approximated as 

2 
MSE(6,) ¥ —2— 


Ny, ), 


(3.3) 


Longford’s allocation is obtained by minimizing the 
objective function 
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>| P, MSE(6,) + (GP, )V(¥;,) 


h=| 


(3.4) 


with respect to the n,. The resulting solution satisfies 


PFC; 0) » Si 

—1 8 t+ (GP_)W, = const., A= 1, ..., L. (G:5) 
(l+n,@,) wi 
hh h 


Longford used an iterative method to obtain the solution to 
(3.5) since it does not have a closed-form solution. 

Our NLP procedure minimizes g(f) given by (2.5) 
subject to 


RMSE(6,) < RMSE,,,/ = 1... L; RV(V,) < RV, (3.6) 


and (2.8), where RMSE(0,) = MSE(@,)/Y,° and RMSE,, 
is a specified tolerance. The approximation (3) to MSE(Q@,) 
is used in (3.6). 


4. Allocation for domain estimation 


Suppose that the population U 1s partitioned into do- 
mains ,U (d =1,..., D) that cut across the strata. Also, 
suppose that the estimators of domain means need to satisfy 
specified relative variance tolerances, ,RV,, d =1,..., D. 
We find the optimal additional strata sample sizes that are 
needed to satisfy the domain tolerances, using the NLP 
method. 

An estimator of domain mean ty, =3N Phew, ye is 
the ratio estimator 


y 
yy Acad SD an Ve 


a _ Al kes), 
ay -~ 


fi 
5 

Dye Ny Dy dk 

h=1 


kes, 


: (4.1) 


where ,6, =1 if ke ,U and ,6, =0 otherwise, s, is 
the sample from stratum / and ,N is the size of domain 
d. The relative, variance of the ratio estimator, (4.1) is 
RV(,Y) =V(,Y)/4¥°, where the variance V(,Y) is ob- 
tained by the usual linearization formula for a ratio 
estimator. 
Let 7, denote the revised total sample size from stratum 
h so that the sample increase from stratum h is 71, — 1, 
Let Fe =n,/N,, be the corresponding sampling fraction. 
We obtain the optimal fi = (71, ..., 7)’ by minimizing the 
sample increase 
~ L L ~ 
BON AD: Np DG NYE! = 42) 
h=\ h=| 


with respect to f = (fi ne fee subject to 
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fay Rea lyse Ley, (4.3) 


RVGKeR MA Ase Ine D: (4.4) 


As before, we reformulate the problem by expressing 
(4.2), (4.3) and (4.4) in terms of k = (k,, eal )’, where 
k, = f,'. This leads to minimization of the separable 
convex function 


fy 
Bik) =a pig 


(4.5) 
h=| 
with respect to k subject to the linear constraints 
isch. fA =h..ak (4.6) 
and 
RV(,Y) = 


g a 

— -1 

Py 7) “83 Sy RV apa SO) 
h 


where ,RV, is the specified tolerance, ,S., denotes the 
stratum variance of the residuals ,e, = ,5,(, — ,Y) for 
k —U, and U,, denotes the stratum population. Denote the 
resulting optimal k, and 7, as ky and fi) respectively, so 
that the optimal sample increase in stratum h is 7) — nj. 
It can be shown that the minimization of total sample size 
subject, to all the constraints RV(¥,,) < RV ,, A =1..., L, 
RVGY) < VRV ed Slee D, RVG) <. RV, Middand 
0< f, <1,h=1,..., L will lead to the same optimal 
solution, 7° = (ii’,..., 7)’. However, domain reliability 
requirements may often be specified after determining n°. 


5. Empirical results 


In this section, we study the relative performance of 
different sample allocation methods, using data from the 
MRTS. Section 5.1 and 5.2 report our results for direct 
estimators and composite estimators of strata means, 
respectively. Results for the domain means are given in 
section 5.3. 


5.1 Strata means: Direct estimators 


For the empirical study, we used a subset of the MRTS 
population values restricted to single establishments. Strata 
sizes, N,, strata population means, Y,, strata standard 
deviations, S,, and strata CVs, C, = S,/Y,, are given in 
Table | for the ten provinces in Canada (treated as strata). 
For the NLP allocation, we have taken the CV tolerances as 
CV,,, = 15% for the strata means y, and CV, =6% for the 
weighted sample mean y,,, denoted Canada (CA). 

The NLP allocation satisfying the specified CV toler- 
ances resulted in a minimum overall sample size n° = 


3,446. Table 2 reports the sample allocation n) and the 
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associated CV(y,) and CV(y,,) for the NLP allocation. It 
shows that the NLP allocation respects the specified 
tolerance CV, = 6%, gives CVs smaller than the specified 
tolerance CV,, = 15% for two of the larger provinces (QC: 
11.4% and ON: 11.0%) and attains a 15% CV for the 
remaining provinces. 


Table 1 
Population values for the MRTS 


Provinces N, Y, S;, C, 


Newfoundland (NL ) 909 963 + 1,943 2.02 
Price-Edward-Island (PE) 280 Dh ABS» 28 
New-Brunswick (NB) 1,333? “1B68" 3,200 9334 
Nova-Scotia (NS) 1,153 7 1,568 4302882374 
Quebec (QC) 14135, 52:006< 4:)29% 2:56 
Ontario (ON) 2193 bea, 1,722. 6,207.60 
Manitoba (MN ) L700. Le29S: 2 Oo ou 
Saskatchewan (SK) 1743 F212" 3,082 42-49 
Alberta (AL) 5292" 1,698" "5.358 Sh 
British Columbia (BC) W803 / £1,297 >A 39S 
Canada (CA) 52,879 1,654 - - 
Table 2 


Equal, proportional, square root and NLP allocations and 
associated CVs (%) 


Province Equal Proportional Square-Root NLP 
Ny, CV, Ny, CV, Ny, CYe ny, CV, 
NL 352 “SA 59 “25,4 169= 140 at ee ete 
1s) 280 =0.0 18 44.1 94 16.2 104 15.0 
NB 352 1057. 87-) 242°"005 = TSO 2 200-ts0 
NS 352”) 1D DAS S06 POPES Te et 2590 TOU 
Oe 352 124 726 85 593 8940 MAI0ite 
ON 352° 193414034294 5 824 12:5 21,0560 Fee 
MN 352>..10.9° Ti geetth 2327 “14a 20Cr rio 
SK 352 19 114. 23:6 92340 15:25 238ai oe 
AL 352. "163" 73450) 316-44 P408e~15.0) wA0orn5o 
BC 352u162; S08 Guls.3c49Germ 135: edOF welSi0 
CA 3,446 9.1 3,446 5.2 3,446 63 3,446 6.0 


Using the optimal overall sample size 3,446, we calcu- 
lated the sample allocations , and the associated CV(j, ) 
and CV(y,,) for the modified equal allocation, 
proportional allocation and square-root allocation, reported 
in Table 2. It is clear from Table 2 that the modified equal 
allocation is not suitable in terms of satisfying specified CV 
tolerances because it leads to CV(¥,,) =9.1% which is 
significantly larger than the specified CV, =6%. Also, 
under the modified equal allocation, CV(y,,) equals 19.3%, 
16.3% and 16.2% for the larger provinces ON, AL and BC 
respecttively. Note that for the smallest province PE Table 2 
gives CV(y,) =0% for the modified equal allocation 
because for PE it gives n, = N,. 
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Turning to proportional allocation, Table 2 reports 
CV(¥,,) =5.2% but it leads to considerably larger strata 
CVs relative to the specified 15% for seven of the prov- 
inces, ranging from 16.4% to 44.1%. On the other hand, 
Table 2 shows that square-root allocation offers a reason- 
able compromise in terms of desired CV tolerances. We 
have CV(y,,) =6.3% and CV(y,) <15% for seven of 
the provinces and the three provinces with CVs greater 
than 15% are SK with 15.2%, PE with 16.2% and NS 
with 18.1%. 

Table 3 reports the results for the Costa ef al. allocation 
(2.1) with & =0.25, 0.50 and 0.75, using n = 3,446 ob- 
tained from NLP. We observe from Table 2 that the choice 
k =0.25, which assigns more weight to equal allocation, is 
not satisfactory for the estimation of the population 
(Canada) mean: CV(y,,) =7.2%, but performs well for 
strata means, except AL with CV(jy,) = 16.3%. On the 
other hand, the choice & = 0.75, which assigns more weight 
to proportional allocation, performs poorly in estimating 
provincial means, with CV(y,) ranging from 16.2% to 
21.4% for seven of the provinces, although CV(jy,,) is 
smaller than the desired tolerance, 6%. The compromise 
choice k = 0.50 performs quite well, leading to CV(j,,) = 
6.4% and CV(y,) around 15% or less except for two 
provinces (NS and AL) with CVs of 17.0% and 16.5% 
respectively. Performance of the Costa et a/. method with 
k =0.50 and square-root allocation are somewhat similar, 
and both allocations do not depend on the variable of 
interest, y, unlike the Longford and NLP allocations. 

We next turn to Longford’s allocation (2.4) which 
depends on g and G. Table 4 provides results for g =0, 
0.5, 1.0, 1.5 and G=0, 10, 100, using n = 3,446 obtained 
from NLP. For g =2.0, Longford’s allocation does not 
depend on G and in fact it reduces to the Neyman 
allocation (1.2) which minimizes CV(jy,,) for fixed n but 
leads to highly inflated CV(y, ), ranging from 16% to 85% 
for seven provinces. We see from this table that CV(j,), 
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for a given q, increases with G rapidly while CV(¥,,) 
decreases slowly as G increases and in fact is virtually a 
constant (*5.1%) for G>100 (values not reported here). 
Also, CV(y,,) for a given G, increases rapidly as q 
increases while CV(y,,) decreases. Langford’s allocation, 
for gq 20.5 and/or G210, leads to significantly larger 
CV(y,) than the specified tolerance CV,, = 15% for 
several provinces, even though CV(jy,,) respects the 
specified tolerance of 6%. On the other hand, for g=0 and 
G=0, CV(j,,) 1s below the specified tolerance except for 
BC with 15.7%, but CV(y,,) = 7.3% significantly exceeds 
the specified tolerance. For g=1.0 and q=1.5, CV(j,,) 
stays below 6% when G=0, but CV(y,,) exceeds 15% for 
SIX provinces, ranging from 17.7% to 34.0% for g = 1.0 and 
22.0% to 54.6% for g = 1.5. On the whole, Table 4 suggests 
that no suitable combination of g and G can be found that 
ensures that all the specified reliability requirements are 
satisfied even approximately. 


Table 3 
Costa et al.’s allocation and associated CVs (%) for k = 0.25, 
0.50 and 0.75 


CVs (“%) for Longford’s allocation with q = 0, 0.5, 1.0, and 1.5 


Table 4 
Province q=9 q=0.5 

G=0 G=10 G=100 G=0 G=10 G=100 
NL 13.5 teed 9:3 29h LT Drees 28.0) 33.4 
PE 12.7 20.4 34.6 2A. 29.6 48.5 
NB PTI P| 25.0 145 19.4 26.8 
NS PCL Leet ies rae 1S 219 
QC 11.0 9.8 2H See 9.4 9.0 
ON 14.9 9.8 S27. 123 OS 8.6 
MN Late LI.6 24.3 14.7 joedon 25 2 
SK T3638 L820 2a9 LD dale OED 26.9 
AL 13. Doon. JE Seah 16.1 13 Boe sho Loy 
BC 1D LO. 15.4 14.7 15.4 1533 
CA FS a5 Sigil ap see) oy 


Province k=0.25 k = 0.50 k=0.75 
Ny CV, Ny CV, Ny CV, 
NL 278 10.1 205 12.4 132 16.2 
PE 214 6.4 149 10.8 83 W728 
NB 286 28 219 14.5 153 17.8 
NS 282 14.2 PANS) 70 144 21.4 
QC 446 10.9 539 99 633 9.1 
ON 615 14.5 878 Vea 1,140 10.5 
MN 292 ne 231 14.0 171 16.6 
SK 292 1b 3y3) 733 eZ 174 17.9 
AL 350 16.3 349 16.3 347 16.4 
BC 391 S33 430 14.6 469 139 
CA 3,446 UE 3,446 Ow 3,446 5:0 
q=1.0 q=1.5 
G=0 G=10 G=100 G=0 G=10 G=100 
MA HOY 38.3 30.4 36.2 40.6 
34.0 45.4 67.3 54.6 67.3 85.6 
LSe5 eel 29.0 Does ee 1.0 30.3 
esse aaeE Th 30.9 24.9 29.4 32.8 
92 9.0 8.9 8.9 8.9 8.8 
10.5 9.1 8.5 93 8.7 8.5 
NSIE ODS 26.5 Z20reg2d.4 Died 
LEO BBS 28.3 Wesy HAY 29.4 
1316 Ss 15.9 14.6 55 15.9 
14.3 15.0 eye 14.5 15.0 yal 
Su ms af! pe Sal 1 
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5.2 Strata means: Composite estimators 


We now report some results for the composite estimators, 
0 ,» Of strata means. We obtained the optimal total sample 
size as n = 3,368 using NLP and the reliability requirements 
(3.6). This value is slightly smaller than the optimal n° = 
3,446 for the direct estimators. For the Longford allocation, 
we used n = 3,368 and calculated the sample allocation and 
associated CVs of the composite estimators 6, and the 
weighted mean y,, for specified g and G, constraining 7, 
to be at least two. For the MRTS data we have used, the first 
term of (3.5) is small relative to the second term. As a result, 
the sample allocation is flat across G - values for a given g 
which means that the CVs for the Longford allocation did 
not vary significantly with G. 

Therefore, we have reported results in Table 5 only for 
G=0 and g=0,0.5, 1.0 and 1.5. We note from Table 5 
that CV(6,) decreases with q for the two largest provinces 
(QC and ON) because the sample shifts from the smaller 
provinces to these two provinces as gq increases. Also, 
CVv(6,) initially decreases for AL and BC but it starts 
increasing when gq is large because the sample starts 
shifting to QC and ON from these provinces as well. 
Further, CV(6,) increases for all other provinces with q 
except for NS for which it starts decreasing for large g 
because of larger synthetic component and very negligible 
bias. In particular, cv(6 ,) increases rapidly for NL and PE 
because of very large bias. 


Table 5 
CVs (%) for the composite estimators using Longford’s 
allocation: G= 0 and qg = 0, 0.5, 1.0 and 1.5 


Province q=9 q=0.5 q=1.0 q=1.5 
NL PAS ‘NAY 24.2 S13 
PE 12.4 23.8 46.0 1122 
NB 10.4 12.8 16.1 20.4 
NS 9.4 11.9 14.5 ii ig 
QC 10.3 9.0 8.3 8.0 
ON 139 1 es. 8.2 
MN iv 13:1 16.0 20.3 
SK 12.4 14.6 17.9 Zouk, 
AL 11.4 112 M5 2 
BC 14.4 13.3 29 lesa 
CA 8.0 6.3 5.4 5.6 


On the other hand, CV(y,,) decreases initially with q 
but starts increasing when gq is large because most of the 
sample gets allocated to QC and ON and very little sample 
is assigned to the smaller provinces. It appears from Table 5 
that the Longford allocation performs reasonably well only 
for g=0 and G=0, giving CV(6,) less than 15% for all 
provinces at the expense of CV(¥,,) = 8.0%. 
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5.3. Domain means 


Establishments on the Canadian Business Register are 
classified by industry using the North American Industry 
Classification System (NAICS). NAICS is principally a 
classification system for establishments and for the compi- 
lation of production statistics. The industry associated with 
each establishment on the Canadian Business Register is 
coded to six digits using the North American Industry Clas- 
sification System. There are 67 six digit codes associated 
with the Retail sector. These six digit codes are regrouped 
into 19 trade groups (TG) for publication purposes. 

We took the trade groups as domains that cut across 
provinces (strata). The trade group with the smallest number 
of establishments is TG 110 (beer, wine and liquor stores) 
with 307 establishments and the TG with the largest number 
of establishments is TG 100 (convenience and specialty 
food stores) with 7,752 establishments. Establishments were 
coded to all the 19 trade groups in all but one province: in 
PE, establishments were coded to only 16 trade groups. 

We applied NLP based on (4.5), (4.6) and (4.7), and 
obtained the optimal total sample size increase to meet 
specified reliability requirements on the domain estimators 
,Y. We found that no increase in the total sample size is 
needed if the tolerance on CV(,,Y) is less than or equal to 
30% for each domain. If the tolerance is reduced to 25%, 
then the optimal total sample size increase is 622 and the 
total sample size after the increase is 4,068. If the tolerance 
is further reduced to 20%, then the optimal total sample size 
increase is 2,100 and the total sample size after the increase 
is 5,546, which is considerably larger than the original 
3,446. Note that as the total sample size is increased, CVs 
of strata means y, and the weighted sample mean y,, 
decrease. 


6. Summary and concluding remarks 


We have proposed a non-linear programming (NLP) 
method of sample allocation to strata under stratified 
random sampling. It minimizes the total sample size subject 
to specified tolerances on the coefficient of variation of 
estimators of strata means and the population mean. We 
considered both direct estimators of strata means and 
composite estimators of strata means. The case of domains 
cutting across strata is also studied. Difficulties with 
alternative methods in satisfying specified reliability re- 
quirements are demonstrated using data from the Statistics 
Canada Monthly Retail Trade Survey of single establish- 
ments. We also noted that NLP can be readily extended to 
handle reliability requirements for multiple variables. Com- 
promise allocations that perform reasonably well in terms of 
reliability requirements are also noted. 
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Calibration alternatives to poststratification for doubly classified data 


Ted Chang ' 


Abstract 


We consider alternatives to poststratification for doubly classified data in which at least one of the two-way cells is too small to 
allow the poststratification based upon this double classification. In our study data set, the expected count in the smallest cell is 
0.36. One approach is simply to collapse cells. This is likely, however, to destroy the double classification structure. Our 
alternative approaches allows one to maintain the original double classification of the data. The approaches are based upon the 
calibration study by Chang and Kott (2008). We choose weight adjustments dependent upon the marginal classifications (but 
not full cross classification) to minimize an objective function of the differences between the population counts of the two way 
cells and their sample estimates. In the terminology of Chang and Kott (2008), if the row and column classifications have / and 
J cells respectively, this results in LJ benchmark variables and 1+ J- 1 model variables. We study the performance of these 
estimators by constructing simulation simple random samples from the 2005 Quarterly Census of Employment and Wages 
which is maintained by the Bureau of Labor Statistics. We use the double classification of state and industry group. In our 
study, the calibration approaches introduced an asymptotically trivial bias, but reduced the MSE, compared to the unbiased 
estimator, by as much as 20% for a small sample. 


Key Words: Calibration; Poststratification; Prediction model. 


1. Introduction 


These equations establish that if the benchmark variables x 


Suppose we have a population U/ which is doubly 
stratified by two categorical variables whose indices are 
denoted (i, 7), i=1,...,/, 7=1,..., J and write U,; for the 
(i, /)-stratum. If a simple random sample S of size n is 
taken and if y denotes the variable of interest a natural 
estimator for the total 7, =>),.y, is the poststratified 
estimator . 


tps - Dae DM, (1) 
i,j 


where N,, is the size of U/, and y,, is the sample mean of 
y over SO U,. This estimator is widely used as long as 
all the sample sizes n, of S 1 U,, are reasonably large. 

What to do if some of the 7, are small, or even zero? 

The standard approach would be to collapse some of the 
cells until all the ”, are big enough. However such a 
collapsing might not be possible in a way that maintains the 
double classification scheme: that is the indices j might 
depend upon i. 

The poststratified estimator fp, is a special case of a 
calibration estimator. Define for each k € U the Ix J 
vector variable x, = (%,,,--.,X,,) Where Xi 1 tit 
k —U, and x,;, = 0 otherwise. The population total 7, of 
..,N,,)' and letting d, = N/n be the sam- 
pling weight and B = (N'n,| N,,n,...,N ‘nj Nin)" 


tpg 7 Dana, B) vy, 


keS 


t= > dys, Nee 


keS 


are used, then aes is the resulting calibrated estimator of 
he 
Me 
Chang and Kott (2008) derived the asymptotic properties 
of a calibrated estimate of the form 


Eeote—s Dielt eq B yy (2) 


keS 


where B minimizes an objective function of the form 


O(B) = 
[1,- D4, Har BDx, v'[T-¥4, Fer Box, | (3) 


kes kes 
In equations (2) and (3), z is a vector of model variables 
whose length Q is at most the length P of the benchmark 
variables x, f is a positive real valued function which 
Chang and Kott (2008) calls the back link function, and V 
is some positive definite symmetric Px P matrix. V_ is 
allowed to depend upon fB as would occur if V(B) is some 
measurement of the variability of ¥,.>d, f(z} B)x;,. 

In Chang and Kott (2008), the realized sample S is the 
respondents from an original sample with sampling weights 
d,. The respondent sample S is assumed to be a Poisson 
subsample of the original sample with Poisson probabilities 
f(Z;.B,) |, for some B,. The asymptotic formulas derived 
there were under an asymptotic framework for this quasi- 
randomization (design based) model. We use the term 
quasi-randomization to remind ourselves that the assumed 
Poisson response mechanism is actually model based. 

It should be noted that the use of calibration to correct for 
nonresponse goes back to Fuller, Loughin and Baker 
(1994), at least when z = x and f(n) =1+ 7. 
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We propose to use the Chang and Kott (2008) 
methodology with x remaining as indicator variables for 
the complete / x J cross classification but letting z be a 
vector of /+J-—1 indicator variables for the marginal 
classifications. In other words, we propose to rebalance the 
sample to come as close as possible, in the sense of mini- 
mizing (3), to the correct cell proportions in the complete 
cross classification, but requiring the rebalancing weights to 
depend only upon the marginal classifications. 

The Chang and Kott (2008) framework applies in the 
presence of nonresponse (and/or noncoverage) if f(z; 
DS )' is the response (or combined response and coverage) 
probability. We note that poststratification, a special case of 
calibration, is often used for the purpose of nonresponse/ 
noncoverage correction. In our test example below, there is 
no nonresponse or noncoverage to correct for, and hence, 
the Chang and Kott (2008) framework applies with B, = 0 
for any f with f(0) = 1. In other words, if the calibration 
is used solely for the purposes of sample rebalancing, we 
can use Chang and Kott (2008) with almost any /. But if 
we are trying to correct for nonresponse and/or non- 
coverage, stronger assumptions are required. 

It should be noted that raking is simply the calibrated 
estimate using the 7+ J—1 indicator variables of the 
marginal classifications as both benchmark and model 
variables and using f(n) = e". Thus we will also explore 
the use of this back link function. 

Section 2 gives the precise formulas for the estimators 
we will use in this study. Chang and Kott (2008) can be 
applied to derive sample based variance estimators and 
these derivations are given in the Appendix. 

In Section 3, we give the results on an empirical study 
using the 2005 first quarter Quarterly Census of Employ- 
ment and Wages, collected by the Bureau of Labor Statistics. 
We will restrict ourselves to the five states which we will 
denote by A, B, C, D, E and to five industry groupings 
denoted by 1, 2, 3, 4, 5. We will not further identify either 
the states or the industry groupings to prevent identification 
of the outlier in the discussion below. This population has 
283,725 firms. From this population we will take Monte 
Carlo simple random samples of size n = 200, 1,000, 5,000 
and use the double classification of state and industry group. 

It should be noted that 0.18% of the population has the 
double classification of state E and industry grouping 5. 
Thus when n = 200, the expected sample size in this cell is 
0.36 and poststratification using the double classification is 
out of the question. 

Kott and Chang (2010) derives the properties of ae 
using a model based framework. The models considered 
there do not apply with our selection of x and z variables. 
However, motivated by their approach, we examine in 
Section 4 the behavior of the estimator ee defined by 
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equation (2), under highly simplified assumptions, including 
that f(y) = 1+ n. This leads in Section 5 to the choice of a 
new weight matrix V~' for use in (3). We then continue 
with our empirical exploration using this new estimator. 


2. Mathematical formulas 


In this section we list the formulas used in this study. 
They are all special cases of formulas in Chang and Kott 
(2008). We assume that a simple random sample of size n 
is taken from a population of size N and we use S and r 
to denote the respondents from that sample and the size of 
S. We assume that the calibration weight function has a B, 
such that f(z’ B,)' is the response probability for an 
element with model variables z. In particular, and without 
loss of generality, if there is no nonresponse problem, we 
assume f(0) = 1. 

The same formulas work with noncoverage, in which 
case f(z’B,) | is the combined response/coverage proba- 
bility. 

We denote N ip S,, and r;, to be the population size, 
respondent sample, and respondent sample size in classifi- 
cation (i, /). Although N,, is assumed known, our method- 
ology does not require the knowledge of the row and 
column classifications of nonrespondents. 

We define NV; = 2; N,, and analogously define WN. 

We will use estimators for a total ib of the form 


r N 
ies Dy Wire (4) 
nj if kes 

where the adjustment weights w, are defined as below. 
These are all special cases of equations (2) and (3) when we 

uses V =o 
The calibrated margins estimator uses f(n) =1+7 
and defines x = z to be / + J —1 independent indicator 
variables for the marginal categories. In this case 7, is a 
vector of N, and N,. The adjustment weights f(z, B) 
have the form w, = 1+, +8, when z is the vector of 
indicator variables for membership in the i" and j" row 
and column classifications respectively. Since the number of 
equations (the dimension of x) equals the number of 
unknowns (the dimension of ), we expect to be able to 

solve the equations 
T= Ya, f(z; BX, (5) 
keS 

exactly. Thus B,. B, solve the linear equations of rank 

I+J-1 


N Rr Pn 
N, = —Y0+B, +B yy 
i 


N . 
Ny =— +B, +B, ny 


which easily follows from (5). 
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The calibrated cell counts estimator uses f(n) =1+ 7 
and defines x to be the // indicator variables for the 
complete cross classification and z to be /+J-1 
independent indicator variables for the marginal categories. 
In this case 7, is a vector of N, and, since V = I, the 
adjustment weights w, = 1+ 8, + B., minimize the objec- 
tive function 


Z 
EDM SD Dd +e, +9 | 


The raking estimator uses f(n) =e" and defines 
x = z to be / + J —1 independent indicator variables for 
the marginal categories. Its adjustment weights are w, = 
exp(B, + B.,) where B,,B , solve the 7 + J equations 


N IS A 
v= Yep, +8, 
J 


N ix A 
N, = — DexP + BL) rj 


Sincep eV = N= >) NV dy these J+J equations yield 
only 7+ J/—1 constraints. It should be noted, however, that 
if a constant c is added to each B, and subtracted from 
each pe the w, are not changed. 

The exponential calibrated cell counts estimator uses 
f(m) = e" and defines x to be the J/ indicator variables 
for the complete cross classification and z to be J + J —1 
independent indicator variables for the marginal categories. 
Its adjustment weights w, = exp (f,. + B.,) minimize the 
objective function 


2 
N A i 
Dp N, S — 2 dexp(B; + Br : 


Chang and Kott (2008) give formulas for sample based 
estimation of the variance of ie In the appendix, we apply 
these formulas to the four estimators above. 


63 
3. Empirical study 


The population we use here is the data from the 2005 
first quarter Quarterly Census of Employment and Wages 
(QCEW), restricted to five states and five industry 
groupings. The QCEW is compiled from mandatory reports 
to state employment offices and hence 1s virtually a census 
and the data we used is the complete QCEW for these five 
states and five industry groupings. This population has 
N = 283,725 firms, divided as in Table 1. 

The response variables y are total employment and total 
(quarterly) wages. For these variables 7, = 2,981,364 for 
total employment and 7, = 2,334,400 (in tens of thousands 
of dollars) for total wages. In this study, we took 10,000 
samples of sizes n =200, 1,000, 5,000. For each of the 4 
estimators, we report the estimated bias, standard error, and 
root mean square error. We also report square root of the 
mean of the estimated variances using the first term of 
equation (15). For purposes of comparison, we report the 
theoretical and empirical values for the unweighted 
estimator N/n>,-5 y,. These results are reported in Table 
2 for total employment and Table 3 for total wages. 

For sample size n =5,000, the expected sample size in 
the smallest cell (state E and industry group 5) is 9.07. 
While this might be a little small for poststratification, the 
probability that this cell has a sample size less than 2, the 
minimum size necessary for variance estimation, is 0.0011. 
In our simulations 9 runs had a cell with sample size less 
than 2. For this sample size, we also report the empirical 
behavior of poststratified estimator, excluding the 9 problem 
cases, using the variance estimate (7.6.5) of Sdarndal, 
Swensson and Wretman (1992) and its theoretical behavior 
using the variance approximation given in (7.6.6) of Sarndal 
et al. (1992). 


Table 1 
Business entities by state and industry group 
industry group 
1 ys 3 + 5 sum 
A 5,986 5,548 ig 3,969 15299 24,514 
11%) (1.96%) (2.72%) (1.40%) (0.46%) (8.64%) 
B 18,782 SiS72 22,012 4,982 4,504 81,852 
(6.62%) (11.13%) (7.76%) (1.76%) (1.59%) (28.85%) 
CG 135518 13,099 1733377 5,610 3,001 53,065 
(4.76%) (4.62%) (6.29%) (1.98%) (1.06%) (18.70%) 
D 30,428 36,017 32,541 10,963 5,392 115,348 
(10.72%) (12.69%) (11.47%) (3.86%) (1.90%) (40.65%) 
E D225 2,020 ZO 1,076 SS 8,946 
(0.78%) (0.71%) (1.10%) (0.38%) (0.18%) (3.15%) 
sum 70,939 88,256 83,212 26,600 14,718 283,725 
(25.00%) (31.11%) (29.33%) (9.38%) (5.19%) 
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Table 2 


Chang: Calibration alternatives to poststratification for doubly classified data 


Empirical comparison of 4 estimators of total employment 


estimator 


unweighted (theoretical) 
unweighted (empirical) 
cal. margins 

cal. cell cts. 

raking 

exp. cal. cell cts. 


unweighted (theoretical) 
unweighted (empirical) 
cal. margins 

cal. cell cts. 

raking 

exp. cal. cell cts. 


unweighted (theoretical) 
unweighted (empirical) 
poststr. (theoretical) 


poststr. (empirical, 9 cases excluded) 


cal. margins 

cal. cell cts. 
raking 

exp. cal. cell cts. 


Table 3 


bias st. err. rt. MSE rt. est. var. 
n = 200 
0 s.22.0 
-1,280 1,068,944 1,068,945 1,059,463 
-1,394 1,105,201 1,105,201 1,048,873 
-218,751 1,008,436 1,031,889 975,140 
-462 Osea POSS 1,041,490 
-227,578 1,000,154 1,025,719 962,153 
n = 1,000 
0 497,144 
-5,435 505,941 505,970 501,144 
-6,212 506,239 506,277 498,946 
-56,118 493,611 496,790 488,222 
-4,854 507,938 507,961 499,237 
-58,891 492,939 496,445 487,281 
n = 5,000 
0 DOS Oi 
1,516 224,088 224,093 222,034 
0 220,315 
1,234 eh IDS, 223,228 221,094 
1,649 223,091 223,098 220,833 
-8,606 222,170 DSB, 220,347 
B1682 230.555 236,383 220,606 
-10,643 223,472 Mere IAS 220,207 


Empirical comparison of 4 estimators of total wages (tens of thousands of dollars) 


estimator 


unweighted (theoretical) 
unweighted (empirical) 
cal. margins 

cal. cell cts. 

raking 

exp. cal. cell cts. 


unweighted (theoretical) 
unweighted (empirical) 
cal. margins 

cal. cell cts. 

raking 

exp. cal. cell cts. 


unweighted (theoretical) 
unweighted (empirical) 
poststr. (theoretical) 


poststr. (empirical, 9 cases excluded) 


cal. margins 
cal. cell cts. 
raking 

exp. cal. cell cts. 


The response variables, total employment and _ total 
wages, are strongly skewed right. There is one firm (in state 
C and industry group 4) whose total employment is more 
than double the total employment of the next largest firm 
and many hundreds times the mean employment of the 
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bias st. err. rt. MSE rt. est. var. 
n = 200 
0 1,682,571 
-11,119 1,551,186 LESS QS 1,543,483 
-11,474 1,582,383 1,582,425 1,510,413 
-214,323 1,451°931 1,467,664 1,413,411 
-11,220 1,579,842 1,579,882 1,501,170 
-221,435 1,438,810 1,455,750 1,393,246 
n = 1,000 
0 751,406 
-2,911 772,495 772,501 768,878 
-4,372 776,955 776,968 768,869 
-51,649 756,201 757,963 751,384 
-4,684 778,302 778,316 769,428 
-54,305 754,963 756,913 749,832 
n = 5,000 
0 333,654 
2,678 336,057 336,068 33/239 
0 8385/05 
1,802 335271 3355210 336,192 
2,510 334,910 334,920 336,064 
-7,149 333,560 33381613) 7 335,006 
-4.679 339,074 339,106 B555230) 
-9,251 334,365 334,493 334,755 


remaining firms. We repeat this study using a population 
with this firm removed. The results are presented in Tables 
4 and 5. In practice with this population, the sampling 
would normally sample this firm with certainty (a self 
representing unit) and samples constructed from the 
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remaining firms. Thus Tables 4 and 5 are perhaps more 
indicative of the relative performance of these estimators in 
actual practice. 

The samples used for Tables 4 and 5 are identical to 
those used for Tables 2 and 3 except that if the outlier was 
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included in the sample, it was replaced by a new obser- 
vation from the population. This was done to improve the 
comparability of the results of Tables 4 and 5 with those of 
Tables 2 and 3. 


Table 4 
Empirical comparison of 4 estimators of total employment: population with outlier removed 
estimator bias st. err. rt. MSE rt. est. var. 
n = 200 
unweighted (theoretical) 0 950,688 
unweighted (empirical) 5,395 975,617 975,632 965,448 
cal. margins > lil 1,019,583 1,019,599 963,314 
cal. cell cts. -211,568 909,070 933,365 877,343 
raking 6,688 1,018,383 1,018,405 956,867 
exp. cal. cell cts. -217,810 902,756 928,660 868,797 
n = 1,000 
unweighted (theoretical) 0 424,552 
unweighted (empirical) -8,393 422,116 422,199 414,019 
cal. margins -9,430 418,153 418,259 408,577 
cal. cell cts. -58,808 408,391 412,603 399,961 
raking -8,135 419,938 420,016 408,611 
exp. cal. cell cts. -61,014 407,780 412,320 399,311 
n = 5,000 
unweighted (theoretical) 0 188,517 
unweighted (empirical) 702 191,631 191,632 188,089 
poststr. (theoretical) 0 187,691 
poststr. (empirical, 9 cases excluded) 563 190,854 190,855 187,180 
cal. margins 820 190,662 190,664 186,664 
cal. cell cts. -9,376 189,884 190,115 186,202 
raking 2,933 205,924 205,944 186,618 
exp. cal. cell cts. -9,922 189,813 190,072 186,140 
Table 5 
Empirical comparison of 4 estimators of total wages: population with outlier removed 
estimator bias st. err. rt. MSE rt. est. var. 
n = 200 
unweighted (theoretical) 0 1,330,930 
unweighted (empirical) 711 1,341,900 1,341,901 1,334,556 
cal. margins 1,256 1,387,484 1,387,485 1,318,285 
cal. cell cts. -201,575 1,225,852 1,242,314 1,194,071 
raking 1,473 1,386,978 1,386,979 [Shes 53 
exp. cal. cell cts. -206,956 1,217,881 1,235,340 1,184,166 
n = 1,000 
unweighted (theoretical) 0 594,370 
unweighted (empirical) -8,169 Ses 587,832 582,524 
cal. margins -10,093 583,606 583,693 576,251 
calacell cts: -56,429 569,158 571,948 563,022 
raking -10,529 584,532 584,626 576,282 
exp. cal. cell cts. -58,435 568,277 SAPS 562,061 
n = 5,000 
unweighted (theoretical) 0 263,923 
unweighted (empirical) 1,185 266,779 266,782 264,110 
poststr. (theoretical) 0 263,339 
poststr. (empirical, 9 cases excluded) 566 265,973 265,973 263,210 
cal. margins 991 265,449 265,45 | 262,556 
cal. cell cts. -8,565 264,126 264,265 261,483 
raking -6,008 2718535 271,602 262,021 
exp. cal. cell cts. -9,070 264,038 264,194 261,394 
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Examining Tables 2 and 3, we see that the P>QO 
methods, that is those that calibrate the cross classified cell 
counts using calibration weights which depend upon the 
marginal classifications, are clearly more biased than the 
other techniques. However the biases of these estimators 
relative to their standard deviations decrease with increasing 
sample sizes. We will show in the next section, that under a 
highly simplified model, the bias has order n' and the 
standard deviation has order n~'’*. Consider, for example, 
the results for the “calibrated cell counts” estimator in Table 
2. In this case, the bias divided by the standard error is 
0.217, 0.114, 0.039 for m =200, 1,000, 5,000 respectively. 
For these values of , the values of n'? are 0.0707, 
0.0316, 0.0141 and it appears that the former series of three 
ratios is approximately 3 times the latter series. 

It also appears that the exponential back link function f 
performs slightly better than the linear choice for /. 
Computationally the former is much more expensive than 
the latter. We also notice that as the sample sizes increase, 
the estimators’ performances appear to converge. This is to 
be expected: because there is no nonresponse, as n — ©, 
B — 0, so that the adjustment weights w = f(z’ 8) > 1. 

Comparing the linear calibrated cell counts estimator to 
the empirical values of the unweighted estimator, the former 
is approximately 7.3% more efficient in MSE when n = 
200 for total employment and 11.7% more efficient for total 
wages. (This means, for example, that the empirical MSE of 
the unweighted estimator is 1.117 times the empirical MSE 
for the linear calibrated cell counts estimator when 
estimating total wages.) For the exponential calibrated cell 
counts estimator, the improvement in efficiency relative to 
the empirical MSE of the unweighted estimator is 8.6% for 
total employment and 13.5% for total wages. Comparison to 
the theoretical values for the unweighted estimator would be 
more favorable to the calibrated cell counts estimators, but 
we will use the empirical results for the unweighted esti- 
mator as the various estimators have all used the same 
Monte Carlo samples. The calibrated cell counts estimator 
and exponential calibrated cell counts estimator still have an 
advantage in MSE over the unbiased estimator at sample 
size n = 1,000. 

When the single extreme outlier is removed, leaving 
283,724 remaining elements of the population, the 
calibrated cell count estimators have somewhat better 
performance relative to the unweighted estimator. For n = 
200, the linear calibrated cell count estimator offers a 9.3% 
improvement in efficiency for total employment and a 
16.7% improvement for total wages. The comparable ratios 
for the exponential calibrated cell count estimator are 10.4% 
for total employment and 18.0% for total wages. 

Finally, the variance estimator in equation (15) has a 
slight downward bias. 
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4. Model based bias and variance 
of calibrated estimators 


Kott and Chang (2010) derived the asymptotic properties 
of ae fv under a different, model-based, probability 
structure. In that paper S is a sample selected with selection 
probabilities d;' so that nonresponse is not an issue in S. 
Rather, if P the number of benchmark variables x equals 
QO the number of model variables z, Kott and Chang 
(2010) assume a prediction model 


y, =x, O+e,,k EU. (6) 
Here 9 is a unknown fixed vector, ¢, are model 
independent errors subject to 

EE, (201. eA) a0, (7) 


and J, is a random variable defined by /, =1 if ke S 
and /, = 0 otherwise. 

When P > Q, the model equation (6) must be replaced 
by 


Vy =(A,x,) 09+6,k €U (8) 


for some limiting Q x P matrix A, (which is defined in a 
suitable asymptotic framework, see Kott and Chang (2010)). 
Thus when x represents indicator variables for the 
complete J x J cross classification, we have that x/0, for 
k in the (i, j)" classification, is the mean value of the 
response variable over the (i, /)" classification. Hence, by 
definition, E(€, | x sles U) = 0 and, since z is a function 
of x, the model (6) and (7) automatically holds when the 
sampling (including nonresponse) is noninformative. 

However, in our application of calibration, 
P=IJ >Q=I1+J-—1 and the model equation (8) has 
no a priori reason to hold. 

Motivated by Kott and Chang (2010) we examine the 
behavior of calibrated estimates under the following 
scenario: 

1. The benchmark variables x are indicator variables 

for some partition of the population into classes C.. 
The model (6) automatically holds where the r™ 
component of @ is the population mean of C.. Let 
f, denote the proportion of the population in C, and 
V, = Var(e, |k €C,). We shall also use the notation 
Var(x,) for V, when k e€ C,. 

2. The sample is a simple random sample of size n 

chosen with replacement. 

3. The back link function f(n) in the estimator E— fv 

of equation (2) is f(n) = 1+. 


Although these assumptions are unrealistic in practice, 
the main purpose of this section is to heuristically justify a 
choice, given in the next section, for the matrix V. At this 
point, we no longer place any requirements on z. 
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We note that in this situation E(€, [Xo Hig € U) = 0. 
Note that (7) will hold if the components of the model 
variables z are functions of x, that is each component of 
Z is constant on each class. However if P > QO, (8) will 
generally not hold. In any event, in this section we require 
neither (7) nor (8). 

We let 


Ha ~ 


and the matrix A,, of equation (8) becomes 


A,, = Vartls 


Eeiiggy = Need tv Where . wv 1s defined as in (2). 
We have suppressed the / in the notation e wv because, in 
this section, f(y) = 1+ 7. Letting y, and x, denote the 
indicated sample means and using Kott and Grae (2010) 


p-|+ ae A.) [252,»,}+0,0° 
N jeS 


N jeS 
Res ion 4 ab , 
fi, v=), +(H,-X,) A(A, VA.) ( 2,9, foe ‘) 
jeS 


a yteex) A (ApMAR) [Oe iOs(7 ) 


=, (ee ay Von Ue (gs obs at es HO sie hail) 
If f,,,y is bounded, as would occur if f(n) = 1+ 1 were 
modified for large n to prevent large calibration weight 
adjustments, we would have 


E(A,2v) = E(¥,) + On") = p, + O71") 
Ey x eX, Gir (Bye) We te, Vt) TO 
+ O(n') 
VarfE(A, 1%) = —0" UP, y)E,Q-P, VI" 
+o(n'), 
where &, is the covariance matrix of x and 
Bipormyesahyanes (puget nits) ggg (10) 
Now 


Var(G,.v|X,) = Var(¥,|X,) + o(n') 


2 LES V(x,) + o(n') 
JES 
E[Var(f.,,v | X,)] = E[Var(¥,| X,)] + o(n™) 
=“ FV. + o(n"). 
lowe 


It is easily seen that 
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x 


Var[E(y, Tee v 2. 6 
n 


Since Var({i,.y) = WarlE(fi, v | X,)] + ElVar(A, v1 %,)] 
and similarly for Var(j7, ), Var(H, »v) < Var(¥,) to terms 
o(n') when 


o(I-P igi 2 (Le Pa ear). <0" Y0192 (11) 


The derivation also establishes that the square bias has an 
asymptotically trivial contribution to the mean square error 


of Lay 


5. A proposed new weight matrix V ' 


In this section we return to our original benchmark x 
and model z variables. When V = I, the identity matrix, 
we see from (10) that Es ,;9 =P, ,9 is the projection 
of 8 onto the span of the’ eee a U,,- The left hand 
side of (11) will be zero if 8 is in this column span. 

For simplicity, we will write 1, as a singular matrix, of 
rank 7 + J —1, with one row for each possible double 
classification cell (i, 7) and one column for each row 
classification 7 and each column classification 7. Thus, the 
(i, j)" row of py, has f, = N,/N in the columns 
corresponding to 7 and / and zero elsewhere. Thus 6 will 
be in the column span of py, if and only if for each i 
and j 


u B (12) 
—=a, + 6, 

i; : 
for some a, and B,. In other words, the 9,,/ f;, satisfy a 
two way ANOVA OE without Se. in the column 
and row classifications. 

Recalling that 0, represents the mean value of the 
variable of interest y in the (i, /)" cell, (12) does not 
appear to be a very promising approximation to the truth. A 
more likely approximation would be the usual two way 
ANOVA model 

0 a, +B. (13) 

Suppose we change variables x = Cx for some 
diagonal matrix C. Note that the rows and columns of C 
are doubly indexed by (i, 7) and we will let c,; denote the 
diagonal entry in the (i, /)" row and solivinri Ret 
6 = C’'6 so that model (6) can be rewritten as 


yy = 646 


Now the matrix p,, has c,f;, in the (i, j)" row and the 


columns corresponding to i and j. Now 9 will be in the 
column span of 1, ifand only if 
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7 Oy * 0, = cy Fj(@; + B;). 


Cij 


= 
iy 


Thus (13) is equivalent to c, = . It is easily checked 


that 
oe 


e(1-P. tel - 


Piety 
Wiz 


CG ae 


ms 
MN 


v)z,0-P, ¥)'8 


when V = C~. We thus propose using the diagonal matrix 
V, whose diagonal entries are /;. 
With this choice of V,, equation (9) suggests the esti- 


mator for simple random sampling 


ye v, ) 04 
N keS 


Ny, + (T,- NX) Voy (AV ay'{+ 


where fi,, = 2 'DzesX,Z,- In our case both p,, and p, 
are known from the N,,, but in the spirit of ratio estimation 
it is preferable to use fi, in place of u,. This heuristic 
observation has been demonstrated using simulations (not 
shown) with the QCEW population. 

We shall call the estimator /, of equation (14) the 
weighted calibrated cell counts eae 

Simulations with artificial response variables y, also not 
shown, demonstrate that when the model (13) holds, then 
weighted calibrated cell counts estimator f,,,, performs 
markedly better than the other estimators ‘considered 
here. Table 6 gives statistics for the estimator C VALLOE te 
populations and variables studied in Tables 2 - 5. 

Comparing to Tables 2 - 5, we see that in all cases ¢ 1,2, 
has the highest bias but the lowest MSE of the estimators 
considered. For n = 200 and the full population, a WV, has 
a 14.8% gain in efficiency (as measured by MSE) rela- 
tive to the empirical results for the unbiased estimator 
when estimating total employment and a 21.1% efficiency 
gain when estimating total wages. For n =200 and the 
population with a single extreme outlier deleted, the corre- 
sponding gains are 14.2% and 21.7% for total employment 
and total wages respectively. 

The Associate Editor suggested that we compare our 
estimators to a poststratified estimator using collapsed cells 
to avoid the problem of empty cells in the sample. We 
explored this question for sample size n = 200 where it is 
most likely that empty cells will occur. We constructed 14 
poststrata. Nine of these poststrata are the nine largest cells 
in the original data. The other 5 poststrata are Al and A2; 
A3, AS, and B4; A4, BS, and C4; C5 and D4; and all cells 
from state E together with DS. After these combinations, the 
5 combined poststrata had sizes that ranged between 4.07% 
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and 5.06% of the population and the 9 retained original cells 
had sizes in the range of 4.62% to 11.47%. 


Table 6 f 
Empirical statistics for ¢ ¥,2V, of equation (14) 
n bias st. err. rt. MSE rt. est. var. 
Full population - total employment 
200 -244,749 967,066 997 556 923,492 
1,000 -64,839 490,758 495,023 483,550 
5,000 -10,767 22 OZ 221,964 219,408 
Full population - total wages 
200 -242,528 1,388,489 1,409,511 13337793 
1,000 -62,091 752,603 755,160 744,315 
5,000 -9,821 332,682 332,62). 333,/02 


Population with outlier deleted - total employment 


200 -236,812 881,844 913,088 842,191 
1,000 -67,468 405,215 410,793 396,105 
5,000 -11,482 189,501 189,848 185,483 

Population with outlier deleted - total wages 

200 -228,441 1,194,922 1,216,562 1,151,417 
1,000 -66,765 565,008 568,939 557,676 
5,000 -11,138 263,699 263,934 260,768 


Unfortunately, the author no longer has access to the 
QCEW data base. Besides the cell counts in Table 1, the 
author has only the means, standard deviations, and 
maximum values by cell. The author constructed a pseudo 
population using the squares of randomly generated gamma 
variables. The square gamma variables were constructed to 
have the same cell means and standard deviations as the cell 
means and standard deviations in the original data. After 
doing this, the square gamma variables were rounded 
upwards to integer values. For these pseudo populations, 
T, =3,149,491 for employment and 2,305,273, in tens of 
thousands of dollars, for wages. 

A square gamma distribution was used because the 
gamma distribution is insufficiently right skewed. Even so, 
in almost all cells the largest value in the original population 
exceeded the largest value in the pseudo population. Of 
course without the original data, we cannot distinguish 
between right skew and a tendency to produce outliers. 

10,000 Monte Carlo samples were constructed were 
taken for each sample size. The results are shown in Table 
7. For the poststratified estimator, 5 of the samples of size 
200 had an empty poststratum and these runs were excluded 
from the results in Table 7. 
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Table 7 
Empirical comparison of 4 estimators 
estimator bias st. err. rt. MSE bias st. err. rt. MSE 
total employment total wages 
n=200 
unweighted 644 1,006,956 1,006,956 -9,970 1,481,450 1,481,483 
poststratified -5,387 1,026,266 1,026,280 -2,149 1,548,833 1,548,834 
cal. cell cts. -224,198 942,164 968,472 -203,531 1,377,823 hee WR 
wtd. cal. cell cts. -248,937 919,419 952,525 -232,558 1,326,234 1,346,469 
n= 1,000 
unweighted =3;3 117 445,676 445,687 1,544 679,148 679,150 
poststratified -2,967 448,218 448,228 1,672 685,370 685,372 
cal. cell cts. -54,311 436,821 440,185 -44,942 665,799 667,314 
wtd. cal. cell cts. -63,327 432,396 437,008 -54,913 660,726 663,004 
n= 5,000 
unweighted 2,466 206,249 206,264 -2,539 304,852 304,863 
poststratified 2,108 205,661 2055672 -2,705 304,751 304,763 
cal. cell cts. -8,265 204,693 204,859 -12,096 303,231 304,472 
wtd. cal. cell cts. -10,551 204,080 204,352 -14,697 302,311 302,668 


Evidently the poststratification did not help. Even though 
no poststratum had an expected count below eight, the 
actual poststrata had quite variable sizes. In addition, the cell 
populations are quite skewed so that the poststrata sample 
means are quite variable. 

The other conclusions for the pseudo populations reflect 
the conclusions from the actual populations. In particular, 
when n = 200 and for the employment pseudo population, 
the weighted calibrated cell counts estimator f,,, has an 
11.8% gain in efficiency relative to the unbiased estimator. 
For the wages pseudo population and n = 200, the effi- 
ciency gain is 21.1%. 


6. Concluding remarks 


The use in (3) of weight matrices V(B)-' which depend 
upon B has not been explored in this paper. Experimen- 
tation with the use of such a matrix was not encouraging. 
Computation time increased dramatically, and there were 
significant numbers of cases which failed to numerically 
converge, with no improvement in efficiency over the fixed 
V estimators considered here. Perhaps the authors did not 
try the right V(B). 

Besides the exponential back link function, the authors 
tried the logistic back link f(y) = (1 +e") '. These runs 
also did not converge. On reflection, the reason is obvious: 
because in the simulations there was no nonresponse or 
noncoverage problems, the calibration weight adjustments 
f(z’B) > 1 as n > ©. But | is not in the range of /. It 
should be noted that in Chang and Kott (2008) a logistic 
back link was used to correct for nonresponse. 

Several obvious issues arise. For example, how would 
the results of this study change if a more complicated 


sampling design than simple random sampling were used, or 
if non response and/or non coverage occur and _ the 
calibration was used to correct for it. Falk (2010) considers 
these questions both theoretically and with further 
simulations using the QCEW population. Falk (2010) also 
considers non linear link functions. 

There are obvious extensions to 3-way (and beyond) 
cross classified data. If /, J, K denote the number of cells 
in each of the 3 classifications, there are //K fully 
classified cells whose totals can be used for benchmark x 
variables. There are LJ + 1K + JK —I—J— K'+1) one- 
way and two-way marginal variables that can be used for 
model z variables. Clearly, one might not want to use the 
plethora of variables available. 

In the context of linear calibration using the same x and 
z variables, several studies have been made on the choice 
of variables. Examples of such studies are Banker, Rathwell 
and Majkowski (1992), Silva and Skinner (1997), and Clark 
and Chambers (2008). The last paper remarks that too many 
variables can deteriorate the MSE of T ve 

The alternatives to poststratification discussed here can 
be used in the presence of small and even empty cells. For 
example, in our simulations, the expected count in the state 
E, industry group 5, cell is 0.36 when n = 200. One might 
be tempted to collapse cells and use poststratification. 
Generally, however, it is not possible to do so and maintain 
the convenient doubly classified structure of the data. Our 
approaches, like poststratification, introduce weights for the 
purpose of sample balancing but avoid collapsing cells. 
These approaches generally increase bias but can offer 
substantial reductions in MSE. 

Furthermore, in the presence of nonresponse or non- 
coverage, the inverse of the weight adjustments can be 
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considered, under a quasi-randomization model for the 
response or coverage, as estimated probabilities of response 
and/or coverage. In our calibration approaches, these proba- 
bilities are assumed to be a function of the row and column 
classifications. When cells are collapsed without main- 
taining the double classification, these probabilities are 
harder to interpret. 
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Appendix 


Here we derive, using Chang and Kott (2008) equations 
(16) and (17), sample based variance estimators for the 4 
estimators studied in Section 2. 

Let 


na 


An 


Fe Oba fv 

A, = EG. 
Here toh re is defined in (2). Hy is a row vector with one 
entry for each z variable. In out case, H, has (J + J —1) 
entries, one for each of the 7+ J —-1 linearly independent 
indicator variables for the row and column classifications. 

For the calibrated margins and calibrated cell counts 

estimators, f(N) = 1+. Define the constants s, and ¢, 
by 


io 
lI 
my 


\| 
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= 


Then a simple calculation shows that if a entry exists in H p 
for the i" row classification, we place in that entry > nee 
Similarly if a entry exists for the j™ column classification, 
we place in that entry }/;¢,. Here we use the convention that 
if the i row or j" column is not one of the chosen 
I+ J-—1 linearly independent indicator variables then 
corresponding £,. or B.; is 0. 

For the raking and exponential calibrated cell counts 
estimators, /()) = e” and we can similarly calculate H, 
using instead 


~A 
II 


N a n 
, = — exp, +8,)7, 
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Here we use the convention that if the i" row or j™ col- 
umn is not one of the chosen / + J —1 linearly indepen- 
dent indicator variables then corresponding B,, or B., is 1. 
Analogously to (2), let 
ae ag yi if (Zi; B) x,. 
keS 


ix yv is a column vector with one entry for each x 
variable. Define the H matrix to be 


CUA ae 
a8 (B). 


H is a matrix with one row for each x variable and one 
column for each z variable. 

For the calibrated cell counts and exponential calibrated 
cell counts estimators the matrix H has dimensions 
IJ x (1 + J —1). Each of the rows of H corresponds to a 
pair (7, 7) of row and column classifications. We place s;, 
in the row corresponding to (i, /) and the columns 
corresponding to the i" row classification and the j” 
column classification (whenever these columns exist). All 
other entries of H_ are set to zero. 

For the calibrated margins and raking estimators the 
matrix H has dimensions (J + J —1)x (J+J -—1). Ifa 
row (and hence a column) of H_ exists for the i" row 
classification we put );8, in the corresponding diagonal 
entry of H. Similarly, if a row and column exist for the /” 
column classification, we put >); 5, on the diagonal of H. 
We place s,, in the entry whose row corresponds to the i 
row classification and whose column corresponds to /" 
column classification (whenever these exist). We also place 
s, in the entry whose column corresponds to the i” row 
classification and whose row corresponds to j“ column 
classification (again whenever these exist). All other entries 
of H are set to zero. 

Let B= Hy (H'V 'H) 'H'V! where currently we 
are using an identity matrix for V. B has dimensions 
1x (/+J-—1) for the calibrated margins and raking esti- 
mators and |x /J for the calibrated cell counts and the 
exponential calibrated cell counts estimators. In the former 
cases, we will denote the entries of B by 5, or 5 - and, for 
the single case when a column or row index does not 
correspond to one of the / + J —1 independent indicator 
variables, we will set the corresponding 5 to zero. In the 
latter cases, we will denote the entries of B by 6,. For 
keS,, let u, = w,(, — 6, —6,) for the calibrated mar- 
gins and raking estimators and u, = w,(y, —6,) for the 
calibrated cell counts and exponential calibrated cell counts 
estimators. 

Essentially Chang and Kott (2008) showed _ that, 
asymptotically, the calibrated estimator has the same form 
as a regression estimator of the form Sarndal ef al. (1992) 


H = 
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equation (6.6.1) where the above B plays the role of B in 
(6.6.1) and the sampling weights d, are replaced by 
Oey Ge 8). For non replacement designs, they propose to 
estimate the variance of ae using the analogous changes 
to Sarndal et al. (1992) equation (6.6.3). 

For simple random sampling, and in the absence of 
nonresponse or noncoverage, the variance estimator works 
out to 


2 
VS Tene (15) 
n 
where s~ is the sample variance of the w,. 

In the presence of nonresponse, if one assumes that the 
respondents S are a Poisson sample from the original sim- 
ple random sample with Poisson probabilities f(z’B,)', 
the variance estimator becomes 


ed —,) ae 16) 
n a er | 


keS.j 


where s~ is the sample variance of the u,. The same 
formula works for noncoverage where f(z’B,) | repre- 
sents the combined coverage and response probability in a 
three stage model in which the covered universe is assumed 
to be a Poisson sample from the desired universe, the 
sample is a simple random sample from the covered 
universe, and the respondents are a Poisson sample from the 
original sample. 
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On variances of changes estimated 
from rotating panels and dynamic strata 


Paul Knottnerus and Arnout van Delden ! 


Abstract 


Many business surveys provide estimates for the monthly turnover for the major Standard Industrial Classification codes. 
This includes estimates for the change in the level of the monthly turnover compared to 12 months ago. Because business 
surveys often use overlapping samples, the turnover estimates in consecutive months are correlated. This makes the variance 
calculations for a change less straightforward. This article describes a general variance estimation procedure. The procedure 
allows for yearly stratum corrections when establishments move into other strata according to their actual sizes. The 
procedure also takes into account sample refreshments, births and deaths. The paper concludes with an example of the 
variance for the estimated yearly growth rate of the monthly turnover of Dutch Supermarkets. 


Key Words: Births; Business surveys; Conditional covariances; Deaths; Overlapping samples; Stratum corrections. 


1. Introduction 


In many surveys a changing population is repeatedly 
sampled so that the level and the change in the level of a 
characteristic between two occasions can be estimated. For 
example, in many countries a monthly business survey is 
held to estimate the level of the monthly turnover and the 
change in that level compared to a month or a year ago; see 
Konschnik, Monsour and Detlefsen (1985). Another exam- 
ple is the labour force survey in which the population is 
sampled on a monthly basis to estimate the number of un- 
employed persons and the unemployment rate. Variance 
estimation is needed to judge whether the observed changes 
are statistically significant. Variance estimation is also 
needed in the design stage of the survey, to determine the 
optimal sample size and allocation or to determine the 
optimal estimator. 

In repeated surveys, changes are often estimated by 
using a stratification of the population. Businesses are 
extremely heterogeneous in terms of size and type of 
economic activity. Therefore, business surveys are usu- 
ally designed as a stratified simple random sample selected 
without replacement (STSRS); see Smith, Pont and Jones 
(2003). In surveys for households or individuals the sample 
is usually not stratified because households are less heter- 
ogeneous. Some social surveys, such as labour force sur- 
veys, however, use poststratification to reduce the variance 
and bias of the estimator. 

In deriving formulas for the variance of an estimated 
change in a population with dynamic strata, one has to pay 
attention to three complicating factors. Firstly, the change in 
a level is the result of two components. One component is 
due to the change in the population mean of units that 
remain in the same stratum on both occasions. The other 


component is caused by the change in the stratum compo- 
sition between two occasions resulting from births and 
deaths in the population and from population units that 
migrate between strata; see Holt and Skinner (1989). Sec- 
ondly, due to the migration of population units between 
strata, the estimated mean of stratum / at occasion ¢ may 
be correlated with the mean of stratum / at occasion ¢ + I. 
Thirdly, another complicating factor is that the population is 
repeatedly sampled, resulting in partially overlapping sam- 
ples between two occasions. Different rotating panel designs 
may be used in business surveys. 

Various authors have derived formulas for design-based 
variance estimators for the estimation of changes. Assuming 
a large population without births and deaths, Kish (1965) 
derived an expression for the variance of an estimated 
change based on overlapping samples. Tam (1984) removed 
the assumption of a large population. Elaborating on Tam’s 
results, Qualité and Tillé (2008) compare several variance 
estimators of an estimated change. Wood (2008) generalizes 
Tam’s results for surveys with unequal probabilities. 
Lowerre (1979) and Laniel (1987) deal with the variance 
estimation of a change in dynamic populations, but they do 
not take stratification into account. Hidiroglou, Sarndal and 
Binder (1995) deal with dynamic populations and strati- 
fication, but not with changing strata. Nordberg (2000) and 
Berger (2004) derived formulas for the more complicated 
situation of a dynamic population with units that move 
between strata. For the Swedish sampling design Nordberg 
(2000) derives formulas using inclusion indicators which 
requires some algebra. Assuming that the size of the overlap 
of two samples at two different occasions is fixed, Berger 
(2004) derives his formulas based on Poisson sampling 
conditional on the sample size per stratum which requires 
some matrix algebra. 
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In this paper, we derive the expressions for STSRS sam- 
pling in a more straightforward manner without assuming 
that sizes of overlaps are fixed. Furthermore, unlike the 
Swedish design, the Dutch one doesn’t require time- 
consuming calculations for estimating one of the variance 
components for a change. In addition, we propose an alter- 
native estimation method for sampling designs with such a 
non-zero component. In order to clarify the variance esti- 
mation procedure, we describe its application to the yearly 
growth rates of the turnover of Dutch Supermarkets of 4- 
week periods. 

The outline of the paper is as follows. Section 2 briefly 
describes the Dutch business survey for monthly turnover, 
including the sampling design. The variance formulas for 
the estimator of a change are derived in section 3. Section 4 
illustrates the variance estimation procedure by comparing 
the variances of two different estimators for the yearly 
growth rate of the monthly turnover of Dutch Supermarkets 
in the period 2003-2004. Section 5 summarizes the main 
results and conclusions. 


2. The sampling design of the Dutch 
business surveys 


Every month Statistics Netherlands estimates the month- 
ly turnover for some of the major SIC codes. The publica- 
tion includes the 12-month growth rates of the monthly 
turnover, i.e., the relative change in the monthly level of 
turnover compared to 12 months ago. Throughout this paper 
we will refer to this growth rate as the yearly growth rate. 

All statistical units or establishments are listed in the 
General Business Register (GBR) that is maintained by 
Statistics Netherlands. The register is updated each month 
for births and deaths from administrative sources, while 
once a year, on December 31, the size category and the type 
of economic activity (SIC code) are updated. Note that the 
registration in the GBR may lag behind the changes in the 
population (births, deaths, size class changes efc.). More- 
over, the (unknown) deaths in the frame may lead to a 
biased estimate of the level of the turnover. In order to avoid 
this kind of bias, it is important to quickly detect and 
remove deaths from the frame. Deaths detected in the sam- 
ple may play a role here. However, a further analysis and 
correction of these errors are beyond the scope of this paper 
on variance estimation for growth rates. For estimating these 
variances, we assume that the population units and their 
characteristics in the register are correct. Likewise, we 
assume that there is zero non-response among the surveys. 

Every first day of the month an STSRS-like sample from 
the GBR is conducted to estimate the turnover of the current 
month. In fact, a rotating sample is used. The sample is 
stratified by size and by type of economic activity. The 
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actual probability of selection depends on size and eco- 
nomic activity. The probability of selection increases with 
the size of establishment, with the largest establishments 
being included in the sample with probability 1. For some 
SICs there are not only survey data available but also data 
from administrative sources. The units already present in the 
administrative files are considered as a separate stratum. 
The estimates from this stratum have a zero variance. 

The sample is updated in two ways. Every month the 
sample is updated to correct for births and deaths in the pop- 
ulation. Once a year, in January, 10% of the sample units 
are replaced and stratum corrections are carried out. We will 
discuss the monthly and yearly updates in more detail. 


2.1 Monthly update 


Each month ¢ (¢ = 1, 2,...) a fixed proportion f, of 
the N, units in stratum U5 is sampled (i=): 
This results in a sample s, of size m, = f,N;,. Hence, the 
actual number of units in the sample may change from 
month to month due to births and deaths in the population. 
Note that apart from minor round-off errors the actual 
sampling fraction f, does not depend on month ¢. In fact, 
the update procedure for s,, in month ¢ is as follows. 
Define Uj,"" as the set of births in stratum / in month 
t —1 and denote its size by Nj,'". The number of sampled 
units trom UU), 40, month 7 jis 1, 07 ie 
addition, denote the further required difference nj, — nj," 
DY, ep, alld Celine Si onu. UY eS; phys nO mee 
the set of units in si,‘ that still exist in month ¢. Let TS 
denote the size of 5), pyp- When 71, px = Mj, Req» Tandomly 
drop the difference, otherwise select the difference from 
Uj\U6, "'\ 5), prr- Note that units dropped from the sample 
in month ¢ — 1 or earlier may be re-selected in month +. 


2.2 Yearly update 


Each January, the sample is updated to account for both a 
re-stratification of the units and a sample replacement of 
10%. All sample units of December that still exist in Janu- 
ary are stratified according to their actual size, i.e., the 
number of employees and the SIC-code of January. The size 
class boundaries themselves remain unchanged. Conse- 
quently, the resulting sample from a stratum according to 
the new January stratification may consist of units with 
different inclusion probabilities because units move between 
strata with different sampling fractions. 

In order to correct for possibly different inclusion proba- 
bilities in stratum /, denote the substratum consisting of 
units that belonged to stratum / in December and in Janu- 
ary to stratum ¢ by U<**!*" and denote its size by Né°o!*" 
(h,  =1,...,H). In analogy with the monthly update 
procedure define si7.np by Sit opp = 5, °° OU, and let 
Nive denote the size of s/%" gp . Since the required size of 
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sample speak from U,-°" in January is mpi, = 
FN", the yearly update of sample s/"".x. is carried out 
as follows. 

Firstly, when nie ae Hennes randomly drop the dif- 
ference from Sj/'prp- In addition, 10% of the ninee 
remaining units in Sj;ppp is replaced by units from 
Uses" \ 58. provided that the latter set contains enough 
units. When there are not enough units available, the 
number of replaced units is only Ny?!" — nj upp. Sec- 
ondly, when nj Rp < Nes select the difference from 
U Peat OS ane Subsequently, an additional replacement of 
pee 0:0n ae units in sj/kq takes place when this 
difference is positive and enough new units are available. 
This procedure is done for all substrata hé, including 
h = ¢. Thirdly, similar to the monthly update procedure the 
number of sampled units in January from substratum 
Us" of new births in stratum @ is mc" = f, Nose". In 
addition, note that this approach can also be followed when 
class size boundaries or sampling fractions are changed in 
January. 

Apart from the stratum corrections in January, the 
resulting sample in month ¢ can be considered more or less 
as a set of SRS samples from the strata U;. When the 
population and the strata h are stable over the years, the 
procedure described so far amounts to a standard STSRS 
sampling design for month ¢. Therefore, Statistics Nether- 
lands uses the familiar variance formulas for the STSRS 
sampling design for estimating the variance of the level of 
the monthly turnover. In the next section we show how the 
variance for a change of the level can be estimated under 
such an STSRS assumption. 


3. Variance of the yearly growth rate 
of monthly turnover 


3.1 Variance of the yearly growth rate 


Let O' denote the total turnover of all establishments in 
the population in month ¢ and g”* the relative change in 
the level of turnover between months ¢ and s, ie., 


i 


O 
ide Rial tog vee 
g 7 ( ) 


Ss 


For the corresponding estimates it holds by definition that 


A 


O if 
= Ge a 
where a “hat” indicates an estimate; for an estimator we use 
the same notation. Furthermore, define 


g =i (1) 


G t—12 = O 
O 
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In order to estimate the variance of the yearly growth rate 
of the monthly turnover, we use the first-order Taylor series 
expansion of a ratio of two estimators. That is, 


var(¢ i. , 


ently ai 
= Var iad 


\. vai(O 2=G2408 2) 
(Oa 
var(O VG var’ P)22G cov 0) 
a (Gey ; (2) 


The major problem is the estimation of cov(O’"’, O'). In 
the next sections we examine this term and its estimation. 
3.2 The covariance term of the yearly growth rate 


Using the stratified sampling design, we can write 
cov(O'*, O") from (2) as 


H H 
Ni 74 -12 ~ 1-12 ~ 
COVO GeO) GON Meeiiie wally Oy 
h=| f=] 
Jel — del 
UND saath =~ t-l2 =f 
= DDN, N,cowld, 5), GB) 
il 
where 0, ” stands for the sample mean of the turnover in 


stratum hf in month t-—m (m=0,12). Note that the 
stratification of the units in month ¢—12 may differ from 
that in month ¢. As we have seen in section 2.2, the standard 
refreshment of the panel takes place in January. Further- 
more, each establishment is allocated to the correct stratum 
h according to its actual number of employees in January 
(h =1, ..., H). To take these design features into account, 
define 


Nj,'*": size of substratum U/,'*", i.e., the set of units that 
in month t—12 belonged to stratum / and in 
macilNelatO sitar, (Wi te lta see tal): 

O,,": the substratum population total of the turnover in 
Uj," in month ¢ — m (m = 0,12); 

O/,": — the substratum population mean of the turnover in 
bbe? Saninnonthed apie ays On ING 
(m = 0, 12)]; 

n,": size of sample s,,”, ie., the actual sample from 
U;,*" in month ¢-m (0 < m < 12); 

0,,": the sample total of the turnover in sj,” (m = 0, 
12); 

0), ": the sample mean of the turnover in s,,” [ie, 
aj,"= 08," /n',." (m = 0,12)]: 

no'*; number of units in the overlap s;,,\7' = 
Sie She 
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—t-m 


O},oLp: the sample mean of the turnover in the overlap 


NVI 2s t—-m 
Sh in months [esto AGO, opt 


i, tah OM2)): 


In addition to the notation in section 2, define the auxiliary 
stratum 0 for the births in months ¢—12,...,¢-—1 and 
likewise, stratum H +1 for the deaths in that period. Then 


—f-1|2 =~! : 
Oo, and 0, can be written as 
Ra 
—f-12 _ >) Nig —t-12 
°;, ag 7-12 Chg 
g=l Ny, 
H t 
£3 Ny — 
O; tf Ds . Ou 
k=0 Ny 


respectively (1 <h, ? < H). Consequently, the covariances 
in (3) can be rewritten as 


H+1 t-12 H a 
te hg —tf-12 ke —=t 
COV(O, ~, 0,.). = COV Ds Ose ean O., (4a) 
g=l My k=0 Ne 
ae l t-12 ~—ft-l2 _¢ —t db 
He ent ~COV(Mny One  sMne One) » (4b) 
h 


—|2 —t-12 =~ 
where we used cov(n),° 0). °, MeO) = 0 (k #h) and 
t-12—t-12 pf =ty _ a 


COV(Mig One > My On) = 9 (g # £). The latter covariance 
is zero because 


—f-12 t=—t 
COV(Mig Ong Mere) 


t-12-— 1-1 TAD 
= Ecov(M, Gt 


‘| t 
he Ong ng 2 Nyy) 


t aes (2 SP VENA” ote 
Y Cor Ong | he > ny | C9 Oe Nig > Nye )$ 
[N92 ie), li 
=O0O+ OF ie 7 COV(Mg ,N,,) = 0. 
In the last line we also used that for 1 < g < H +1 
(CP) 3 
COV(M,, ©, My) = 0. (5) 


For a justification and the underlying assumptions of (5), see 
Appendix A. Moreover, in Appendix A we propose an 
alternative estimation method when this covariance is non- 
negligible. The covariance in (4b) can be expressed as 


t-12 —f-12 
COV(Myy, Oj, y 


? ee On) 
oF t-12 —t-12 tb —=—t | 
= EXcov(Myy ~ One, Une One | Vane )$ 
ime r=2 
ie cov{E(n), 0 a | Vie ds E(n,, 0; On, | Vie} (6) 


at t=12 
where v,, = (YM, 


right-hand side 1s 


pasha a 
Uri ia bp th ). The first component on the 


s t-12—f-l2  t —t r 
E{cov(n), One > Mne Ons | Vad yy 


12 —t 
> On: | Vag } 


7 apaal t-12 l 

/ pay, hf hl 
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=Ein, nicov@., 


Seale (2) 
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In the last line we used (26) in Appendix B. Furthermore, 


t-l2\0 
Nie 


t-12 
See pe (Oni 


t-12,t __ 
She ras: 
he 


Oj,). (8) 


OF 12 
Orn, )( Ae ae 


The Sus component on the right-hand side of (6) is equal 


to O/-? Of. cov(ni,"’, ni.) = 0 on account of (5). It there- 
fore follows from (4) and (6) that 
COV a 0, ) = 


t-12_ ft t-12,t 
E Ane Nye Ang l 


t-12_f¢ apes eye 
A, My \ Map Nye Nap 


Ate 12,t : (9) 


3.3 Estimation of the covariance term of the yearly 
growth rate 


Expression (9) can be estimated from the overlapping 
sample sj,'~’ by 


eae. ni t—-12,t 1 
RS Sn he hi —% SG 12,t 
cov(o, 1°, 0, ).= ENP? LieT ioe i Shoup» (10) 
h ( he Ang hi 
where 
l nig 
Gt-12,t _ t-12 —t-12 t —t 
S,corp = Git 1 oF (On; — Onc orp) (Onc: — Oncor) 
he - i=] 


Note that (10) is unbiased for estimating (9) because 
Ried Vie) = Cece 


Although (10) results in reasonable estimates for sufficiently 
large Apr a disadvantage of the covariance estimator 
Fore in (10) is that for small 7j,'°" it may lead to a 
negative estimate of var(O' — G*~ i 201-1) in the numera- 
tor of (2). Recall that this variance is estimated by 
var(O' HGPEV OAS) = var(O")+(G" PP var(O'"’) 
— 2G"? c6v(0',0°). (11) 


Therefore, we propose an alternative estimator to $/-676 in 
(10). Define the standard Boe 


7 -m 
Ane 


GS t—m t—m ais m2 
he a hh ie a aes Ong; ae One Ji 


hf = 


Gai D2): 


We propose the following modified estimator for S/7'”" 


Gt-12,t _ at-12,t t-12 Gt 
Sie” = Pieorr Si he (12) 
Sipe : : 
where p,,” is the correlation between the variables o' and 


t=12 Ny shal: : _ t-12,t 
O in U;,, and Dan orp is its estimate from s;,, 


According to (10) and (12) covariance (3) can be estimated 


15am Wal t-l2 art t-12_¢ 
2. Ds Ni, N; t-12,t iif Ane Hes ha 12,t 13 
2 Aye ; ‘ hil ( ) 
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For the estimate 6j;61p, Binoip <1 always holds 
whereas using (10) may lead implicitly to an estimated 
correlation larger than | and a possibly negative outcome of 
(11). See the next section for an example. In all applications 
met so far, negative outcomes of (11) could be explained by 
the fact that unlike (12) use of (10) leads implicitly to an 
estimated correlation larger than |. This is in line with the 
findings of Berger (2004, page 462) that an overestimation 
of the correlation between O''? and O' may lead to a 
serious underestimation of the variance of a change. Never- 
theless, in some extraordinary circumstances, the use of (12) 
might lead to a negative outcome of (11) as well. Sufficient 
conditions that the use of (12) leads to a nonnegative 
variance estimator with probability | are available from the 
authors upon request. For a general review of variance 
estimation methods in business surveys, see Brodie (2003). 

Applying (12), a special problem may arise when 
n,, =1 or nj, =1. In order to evaluate the required sam- 
ple variances, one may borrow the sample variance from a 
related substratum or from the same substratum in an earlier 
month. Alternatively, one may impute a variance when it 
emerges from the data that there is a relationship of the form 
S;,xo Of}; see Sairndal, Swensson and Wretman (1992, 
page 461). In addition, the corresponding covariance term 
might be ignored when its (expected) contribution to the 
total variance is small. This is often the case when the 
sampling fractions in strata 4 and ¢ are small, that is in 
strata with relatively small units and, consequently, with 
small variances compared to the strata with larger units. 
Similar remarks apply to the imputed pj,’ when 
Ee 2, andi, > 2 (m= 0,12). “Since the. pr,” 
are often fairly high, this seems to be a viable way. In the 
example given in section 4 the Doe have an overall mean 
of 0.90 and a variance of 0.0074 so that the impact of the 
imputed pi," on the final results is likely to be moderate. 

Furthermore, note that when 7," = 0 (m=0 or m= 
12), the corresponding covariance term in (13) can be 
neglected without affecting its unbiasedness, provided that 
the remaining S/-'°’ are estimated in an unbiased way. 
Under this assumption such a term with 7," = 0 (m = 0 
or m = 12) can be neglected because the expectation of 


Wigs [ 2 Ay Nye \sie (14) 


t-12,t y7t-12,t 
Nye” Nog 


from (13) is equal to 


t-12_¢ 

Plt Aye Mng A t-12,t re 
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and the expectation on the right-hand side is the parameter 
to be estimated. Moreover, when 7,” =0(m=0 or 
m = 12) and consequently 7j,,'”' = 0, the outcome of (14) 
is zero and the estimator $/-'?" for S/-'*' becomes ir- 
relevant. Therefore, ignoring such a term when 7,,” = 0 
(m = 0 or m = 12) does not affect the expectations of (13) 


and (14). 


3.4 A comparison with Nordberg’s results 


Using the standard formalism of inclusion indicators 6;,, 
for each stratum, Nordberg (2000) derives a different ex- 
pression for the first component in (6). However, it can be 
shown after some algebra that our expression (9) is equiva- 
lent to Nordberg’s (3.4); a proof is available from the 
authors upon request. In addition, Nordberg derives a non- 
zero expression for the second component in (6), i.e., the 
covariance between the two corresponding conditional 
expectations. Note that the Swedish sampling design is 
somewhat different from ours. 

According to Nordberg (2000, page 370) the estimation 
of the second component for the Swedish sampling design 
requires a computer-intensive procedure which includes 
simulation of the sampling mechanism. However, since all 
ni, M,, and ni,'" are ancillary statistics, an alternative 
might be to condition on these statistics so that the second 
component can be ignored. Recall that a statistic is called 
ancillary when its marginal distribution doesn’t depend on 
the target parameters to be estimated; see Cox and Hinkley 
(1974, pages 31-35). Such an alternative approach without 
the second component is to be recommended especially 
when $erons © Spsay Where ,\0;' is the poststratified 
estimator based on the substrata 4’. However, when the 


difference between Si.a5 and @,5' is non-negligible, the 
calculation of the unconditional variance seems to be 
indispensable, including the estimation of the second 
component according to Nordberg. For a different approach 
to the estimation problem of the second component, see 
Appendix A. 

For a justification of the use of a conditional (co)vari- 
ance, see Holt and Smith (1979). An important advantage of 
the conditional (co)variance is that the corresponding confi- 
dence interval has better coverage properties than the one 
based on the unconditional variance. Denote the standard 
conditional 95% confidence interval for an arbitrary para- 
meter 0 by (6, 6,| v) where v denotes the vector con- 
sisting of all (ancillary) statistics involved in the conditional 
(co)variances. Then under the normality assumption and 
some mild conditions it holds that the actual 95% con- 
fidence level (CL) equals the nominal confidence level 
because 
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Cate MEG PGP ae ray) 
= 0.95) 18 0,95, 


where Q,, stands for the set of all possible outcomes of the 
random vector v. When unconditional (co)variances are 
used, the confidence intervals thus obtained may be quite 
inaccurate for a given sample allocation. Moreover, when 
averaged over all allocations CL may differ from 0.95; for 
an example, see Knottnerus (2003, pages 133-135). Note 
that in the planning stage before the sample is drawn, 
unconditional variances are always useful for examining 
Kish’s design effect for a comparison of different sampling 
designs. In addition, note that for evaluating a conditional 
confidence interval for g’'*’ the underlying variances of 
O35”, should also be taken conditional on the v,, 
Cit = 0, 2): 

Finally, the unbiased estimator proposed by Nordberg 
[2000, Equation (3.9)] for the first component in (6) is quite 
different than those described in the previous subsection. In 
fact, his estimator is based on the following procedure for 
estimating the covariance term Si" Firstly, estimate the 
underlying quantity > a ‘orl, from the overlap 
s,,'"". Secondly, estimate the corresponding turnover 
means from s),'* and s',, respectively. Since the compo- 
nents thus estimated stem from different samples, a negative 
outcome of (11) cannot always be avoided. For a small 
example with real data, see the following section. In the 
remainder Nordberg’s underlying estimator for S/, cathe 
denoted by S/n\p%. A derivation of the explicit expression 


for S/n\e is available from the authors upon request. 


i200 


4. An application to the change of turnover 
in Dutch Supermarkets 


4.1 Two estimators for the yearly change of 
turnover 


For the impact on the variance estimators it is important 
to know that in January the turnover is estimated twice. The 
first estimate, denoted by CO in0 (with O for old), is made 
before the yearly sample update and is used to estimate the 
monthly change of the turnover in January compared to that 
in December. The second estimate, denoted by O}"“ (with 
N for new), is made after the yearly sample update and is 
used to estimate the monthly change of the turnover in 
February compared to January. This procedure implies that 
units of the old sample as well as those of the new sample 
receive a questionnaire in January. 

Unlike estimator (1) the actual estimator used by 
Statistics Netherlands for the yearly change in the monthly 
turnover is based on a chain of 12 monthly changes in 
turnover 
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oy O30 
ie las = 6 ian (¢ # jan). (15) 


In this section we will compare the variances of estimators 
(1) and (15). Similar to (2) the variance formulas for ¢/°"” 


can be derived by a first-order Taylor series expansion. 


4.2 Description of the data 


The calculations for the variances and confidence inter- 
vals in this example are based on turnover data of Dutch 
Supermarkets of 4-week periods in 2003 and 2004 (ie., 
t =1,..., 26). Hence, there are 13 observations in one year 
and, consequently, we use slightly adjusted symbols such as 
¢”? in the remainder of this section. 

The population consists of about 3,500 establishments. 
The turnover data stem from a stratified sample and admin- 
istrative files. A gross STSRS sample of about 900 units 
stratified by size is drawn from the full list of population 
units of the GBR that includes the units of the administrative 
files as well. Establishments with 50 or more employees are 
included with probability 1. The other establishments are 
sampled with decreasing inclusion probability from 1:2 (20- 
49 employees per establishment) to 1:40 in the smallest size 
(1 employee per establishment). The administrative files 
contain about 950 units, present in all size classes. About 
500 of the 900 units in the gross sample were already 
present in the administrative files, but they do not receive a 
questionnaire. Thus, the net sample contains about 400 
units. In fact, the sample size for each stratum in this spe- 
cific example is random. However, as explained in sub- 
section 3.4, we estimate all (co)variances conditional on the 
n, in such a case. Data from units within the administrative 
files are put into a separate stratum with the sampling frac- 
tion being unity. 


4.3 Results 


Table 1 gives the yearly growth rates and their 95% 
margins for ¢ = 16,..., 24. It emerges that the 95% margins 
for the estimated growth rates ¢°'"'°, currently used by 
Statistics Netherlands, vary between 0.8 and 1.0 (per cent 
point). For example, in the first period (¢ = 16) the 95% 
confidence interval for the yearly growth rate is -1.3 to 0.7 
per cent. As expected, the 95% margins for the more 
compphcsts estimator &°'""? are close to those for the 


act 
simpler ¢°'' from (1). The 95% margins of ¢°"° vary 
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between 0.7 and 1.0 (per cent point). The estimator for the 
growth rate to be preferred is ¢/°'"'° as it corrects for the 
yearly sample update in January. The estimation of its vari- 
ance, however, can be simplified by using the variance esti- 
mator described in section 3 rather than the more laborious 


sae 619 
expression for var(¢,./”) 


Table 1 
Estimated growth rates with 95% margins 


t Bhi x 100% et x 100% 
16 -0.3 (+ 1.0)! -0.4 (+ 1.0) 
17 -3.7 (+ 1.0) -3.8 (0.9) 
18 1.6 (+ 1.0) 1.5 (+0.9) 
19 -2.2 (0.9) -2.3 (+0.9) 
20 0.5 (+ 0.8) 0.4 (£0.7) 
2 -1.7 (40.8) -1.8 (+ 0.7) 
sip -2.2 (+0.8) -2.3 (0.7) 
23 0.0 (+ 0.8) -0.1 (£0.7) 
24 -2.3 (+ 0.9) -2.4(+0.9) 


'The 95% margins are given between parentheses. 


As described in section 3, we have used the estimated 
correlation 6/6,» from the overlap s;,*" to estimate co- 
variance S\"'>’ in order to avoid negative outcomes of (11). 
Knottnerus and Van Delden (2006) evaluated the bias of 


0 ed for the Dutch Supermarket data and found a small 


underestimation of eae resulting in a minor, less than 
5%, overestimation of var(¢°'"'’). 

The use of estimator S/-5;p in (10) may give a negative 
outcome of (11) and an estimated correlation 6,,'°" larger 
than 1. For example, consider the specific population with 
N =50 and H =1 consisting of the units of substratum 
ht = 65. From the panel data for this population, given in 
Table 2 for tf = 3 and ¢t = 16, we obtain after some calcu- 
Brions) S* *=84108, 8’ = 394.3 and G7 = 1,028, 
Note that in the remainder of this section the subscript 
hé = 11 1s omitted in the symbols because there is only one 
stratum. Table 3 gives, for three different approaches, some 
additional estimates for the panel data in Table 2. For 
example, using S/7}°" in (10) results in an estimated 


Table 3 
Estimates from three different approaches 
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correlation 6‘'*’ =1.39. This then yields a negative 
variance estimate from (11) of minus 2.2 million. Likewise, 
for the same data the alternative estimator S/3" of S 1" 
based on Nordberg (2000) results in minus 36.9 million as 
outcome of (11) because the corresponding estimate 6410’ 
becomes 1.64. In contrast, using the correlation estimated 
from the overlapping sample s‘'*’ according to (12) yields 
bop. =0.9997 and the positive variance estimate from 
(11) becomes 52.1 million. In addition, for the panel data in 
Table 2 the outcome of Nordberg’s estimator (3.9) for the 
covariance between O''* and O! is 111.1 million whereas 


covariance estimator (13) proposed here yields 67.8 million. 


Table 2 
Panel data’ from a population with N = 50 and H = 1 


period turnover per unit (in thousand euros) 

1 Z 3 4 5 
t=3 493.9 264.3 ite gAS 5) 380.0 
t=16 475.3 472.0 267.0 1,169.0 


‘Actually, the panel data belonged to substratum hf = 65. 


5. Conclusions 


The variance formulas obtained in this paper are useful 
for calculating the variance of an estimated yearly growth 
rate of monthly turnover. The use of (13) as an estimator for 
cov(O*"?, oo) results in reasonable estimates of the co- 
variance of change in particular. The variance estimation 
procedure allows for rotating panels, births, deaths, and 
units that migrate between strata. 

Furthermore, we recommend estimating a population 
covariance according to (12) based on the corresponding 
correlation estimated from the overlap and on the corre- 
sponding variances estimated from the larger separate 
samples. This may help to avoid a serious underestimation 
or a negative outcome of the variance estimator for the 
yearly growth rate. The resulting estimated covariances are 
only slightly biased. 


approach parameters to be estimated 
gt ist punk var(O" ie gicag BOP Tema 
Nordb 2000 estimator 6 1-13,t 6 t-13,t Eq. (11) 
ordberg ( ) § ee 
§t-13 §t 
result 265.2 x 10° 1.64 -36.9 x 10° 
Eq. (10) estimator git grt Eq. (11) 
OLP 
gilt 
result 225.0 x 10° 1.39 22 210° 
Eq. (12 estimator IB GAS ies ris Eq. (11) 
q. (12) Porp SS POLP 
result 161.9 x 10° 1.00! 521 ade 


"Tn fact, 0.9997, 
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For the sampling design of the Dutch Supermarkets the 
second covariance term in (6) is negligible due to the fact 
that nj°3i0, is fixed. In contrast, for the SAMU design in 
Sweden this term is non-negligible and its estimation is 
time-consuming; the word SAMU (SAMordnade Urval) is a 
Swedish acronyme for coordinated samples. In Appendix A 
we propose an alternative method for estimating this co- 
variance. However, under the condition that °° '* ~ opi \ 
it suffices in our opinion to only use the first covariance. 
This simplifies the estimation procedure considerably. 
Moreover, under the normality assumption the conditional 
confidence interval has better coverage properties compared 
to the unconditional interval. 


The example of the Dutch Supermarkets shows one of 


the practical applications of the variance formulas: deter- 
mining which estimator has the smallest variance. The re- 
sults confirm that the variance of the simple estimator 
&"' is close to that of $°'" from section 4 which cor- 
rects for the sample refreshment in January. Hence, for the 
Dutch Supermarkets var(¢"'"'*) might be used for esti- 
mating var(°''°). For branches with another SIC code it 
needs to be checked whether var(g.0/"'*)=var(¢*""'’) since 
the impact of the refreshment in January need not be 


negligible. 
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Appendix A 
Justification of (5) 


Firstly, consider the case of strata without births and 
deaths. Apart from the yearly update in January, there are 
now no monthly updates. Hence, ni, = ni; is fixed 
from which (5) follows. This case applies to the Dutch 
Supermarkets because that population has been quite stable 
over the years. Secondly, in case of births and deaths among 


the strata we can write n7,, as 


H 
t Ct t 
Nig = Ny — Noy sap nT (16) 
k+h 
7 t-12,t 
where fo, 
births in months t—12, ...,¢-—1 among s/. Because the 
sampling procedure einone ng new births after month ¢ — 
lhe > 12 
12 is independent of the 7,,'°, the random variables nj," 
-12 
and n,,- have a zero covariance. Furthermore, using 
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or, for short, 7, stands for the number of 


cov(ny. n.,)=0 for k #h, it is seen from (16) that 
covGn, Mane) = 0107 =a ee 

In fact, it is assumed so far that the distribution of nj, 
(k # h) can be described by a hypergeometric distribution 
with parameters (N/, N'*", n/) irrespective of the values 
of the 1, '? A similar remark applies to 1j,'~’. However, it 
can be argued that in practice these assumptions lead to a 
minor, second-order error in the variance formulas. In order 
to trace this error, we assume for simplicity’s sake and 
without loss of generality that (7) births and deaths do not 
migrate between strata, (ii) there are no deaths among the 
births, (iii) 1, = J, No), is fixed, (iv) after their first month 
in the population births are irrelevant for the monthly 
updates during the rest of the study period and (v) deaths are 
not selected in or removed from the sample by the monthly 
updates; so a third-order error is still ignored. Under these 
assumptions we now look more closely at the second 
covariance component for / = h, say Cy, ,... from (4a). In 
analogy with (6) C, can be written as 


hh, sec 
me 1 
hhissec. >) Ga=lDw st 
Ny, nN, 
H+\ H 
tH org ‘Sead 
cov El 2h Ong lv, [2 Dane OulV, 
k=0 
H+l H 
oe t ZO t-12 t 
o 5 T= n' yD ice. Orn,  COV(N;,, iy, )3 (17) 
h Ny, g=l k=l 


WHET Gs gi—1(Prn nn NIt, re Tes eel 12) a Oca a ULE 
the above assumptions C;,,,,.. = 0 for ¢ # h. 

To estimate the covariances in (17), consider the formula 
for the conditional expectation of y given x = x, when y 
and x follow a bivariate normal distribution. That is, in 
standard notation, 


Oo", 
E(yl x) = bw, +E Oy - HY): 
Oy 
In addition, for a given change Ax, of x the conditional 
expectation of the change of y is equal to E(Ay | Ax, ) = 
0), Arg / G. or, equivalently, 


ae E(Ay | M0) 


- 18 
Jy be Ne ( ) 


So for estimating, for instance, cov(7),j;.;.My,) in (17) 
under HULU) it suffices to evaluate the expected effect on 
y= n, , caused by a change of the future deaths x = 7, j;,, 
ILS 

Let Ait y., denote an additional (positive) change of 
these deaths in s,'*. Define pi, by pitt, = Ni ,/ 
Nia Where N fine is the number of deaths in stratum / 
between January and month ¢. Also, pj. = Nj." / Nj, 
(g =1,...,H +1). Using assumption (v), the expected 


number of additional deaths in the sample of January before 
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an, tf 


the refreshment can be estimated by pj"7', Ani, ;7,;.. Sub- 
sequently, the expected number of additional deaths in the 
sample after the refreshment can be estimated by 


ioe sae An,,, Pate 
eta Oram (19) 
jan 


where y,., 1s the reduction factor due to the refreshment in 
January. For the derivation of (19), see the end of this ap- 
pendix. The corresponding monthly updates between Janu- 
ary and month ¢ due to these additional deaths in the 
sample from stratum / lead to the following estimate of the 
expected increase of incoming units n,, from stratum k 
(k # h) inthe sample of month ¢ 


t t-12 jaw lane t—12 
E(Ani,| Bens = Vred Pi, 1 Qn 1 ja (20) 


where pj, = Ni,'>'/(Ni—Ni,). Recall from subsection 2. l 
that an update in cart s occurs only when d}'# f,Pi yah y 
where D,, (d;) stands for the number of deaths in U;, (s;,), 
and that n,,=,Ny,” is fixed when N}%;!, = 0 (k # A). 
Furthermore, note that births are excluded in the definition 
of p,,, in (20) because of assumption (iv). 

Next, define for m = 0,12 


NESE 
Or a I (On m 
h a Nia hi 9 
h i=] 
wien 
t—m rt Or m 7\t-m»?2, 
5) eee rede Te 
h a 
(pNP 2y ~ eth t-12 , 
POR Mac Geart  eeats 


ica lead Ee tiaael oes 
Pizh ad l Phn> 


H t—12 


= ol Pig t-12. 
On <1 ae eS jae Oe > 
g=l Ph<u 
H Pi 
lati = kh yt 
Oe =~ S t Orn: 

k=1 Pink 
k#h 


Now using (18) and (20), we obtain for k # h the fol- 
lowing covariance approximation 


acov(), 1 H+) Ni) 


t t-12 
E(Any,| Ani ret t-12 
Kyi Vary, 7741 
Ap 1141 
jan 


Yted gas Pin ny, Da d SB Apu = hn) 
Tie aya! ts we (21) 


v 


jan jan,t 


A,*= 7 Yted Pin h Pi, H+ pels pee yd TT ti)s 


where, for simplicity, we omitted the onlls N,, 


12 
(Nj '*—1) in the second line. Because nj,” is fixed, 
t-12 Fiabe) sett 

holds that cov(my jr. My,) = —COV(Mj +... + My a 


Hence, in analogy with the S ASHAS Ber geamneirts distribution 
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we can use for |< g<H and k#/h the following 


. . . . ei i} 
relationship for an approximation of cov(),,"°, 1,,) 


t-12 
f12 Pre t12 t 
ACOV(,» > Tuy) =o apap ACO (pears My, ) (22a) 
Ph<H 
t-12 t 
= ee Pia 
ye h t-12 A, (22b) 


Pp, <H Pe h 
where (21) is used as well. Alternatively, note that 


12 
COMM is.) 


iy nyo 


< - 1-12,t 
gsH iGU jg 


12 
cov(n), H+)? i) = 


t-12 
COV(S 0; i), 


where 


12 


hg 
gia 


-_ [Lif @ unit in U;,'*" is included in sample sj, 
0 otherwise. 


Hence, by symmetry, cov(§).;°, My,) = —COV(M), i741 My, )/ 
Nj, 23; from which (22a) follows (1 < g < H). Likewise, 
for k = h we obtain from (21) and (22b) 
A00vG, ping = —n Ay: 
ae is x 76, t (23) 
acov(,, Nin) = My Phe An! Ph cn 
respectively (1 < g < AH). Now substituting (21)-(23) into 


(17), we get the approximation 
C =) A OG = O,, ore )(O in, h =O, in. (24) 


hh, sec 


Assuming that the two terms between parentheses in (24) 
are absolutely smaller than S,, it follows from (24) that 
| ee Pana geet Pra iz Pyrw dd - ti) 


ee 


hh, sec 
Nn, 


7 


Hence, when p,,,, jam < 0.1, we may conclude that 
under the above assumptions the contribution of the second 
covariance component is less than 1% of var(@, ) so that 
(5) can be used without severely affecting the results. When 


Cinsec 1S non-negligible, it can be estimated from the sam- 
ple according to (24) by 
Cine a 
—t-|2 —t-l2 
A, {Gna —Oj ay \Gin h me cov(O;, < sie) )}/ Mh, (25) 


where in analogy with (10) and (12) c6w(G/217,0;,) is 
defined by 


i PR 1 

mae ah hh 12,t t-12 At 

c6v(G;, 27750),) = 71D Wey NEE Oc poen Shir 
Anco \ nn "nn hh 


We used in (25) that (i) for two arbitrary (unbiased) 
estimators @ and 5, E(ab) = = ab + cov(a, b) and (ii) 
cov(G,, , 0) =0(g #h ork #h). 
We conclude this appendix with the derivation of (19). 
The expected number of additional deaths remaining in the 
jan, ¢ 


sample of January during the refreshment is 0.9p;"7;,, 
Ani, ;;.;. The number of deaths outside the sample just 
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before the refreshment can be estimated by WN es — 
cA ae Deel Ant roe Hence, the number of new deaths in 
the sample due to the refreshments in all substrata U ies 
(1 < g < A) in January can be estimated by 


AT ane ane ALL jane t-12 
) H+ “has Pi AMh ya 


DAT yaar (ie) Rane IED: HE ctan is UR SSP 
; h Oh jan t-12, jan jan jany ° 
N, ‘ Non = (1, ined Mop ) 
é j Sah : 
Now using 7)" = f, No, °°“ according to the above as- 
sumptions, it is seen that after the refreshments the final 


t-12 


number of additional deaths in the sample due to An, ;/,, 
can be estimated by 


0.1(7;," ¥ Non) | jan, t 
) 


eS) 
7 Tyan qyfl2.jan (jan jan Pi AM, 41 
h Oh MN, Moh 


& ae) jan, t 1-12) jan) jane t-12 
= Pir AM 1 = Vred Pitt A 41 


Appendix B 
Some useful covariance formulas for overlapping 
samples 


Let s,,, denote a mother sample consisting of three 
mutually disjoint SRS subsamples s,,s, and s,. Let the 
variable x be observed in s,, and the variable y in s,,. The 
corresponding sample means are denoted by X,, and Y,,, 
respectively. Denote the size of s, by n, (k =1, 2, 3, 12, 
23), (Delite hi fix) 1), SU okt and at Mel lve 
Furthermore, define S. by 


1 N ay ik 
559 ee oy C.F 0 8 
N— ] j=! 


Then the covariance between X,, and y,, is equal to 


COV(X1>,. V3) = 


r ] ] Age took 
—— Se = a Sa cant | Insite are 7s Dae (26) 
nsoeNaar NEG GNA: MING Ke 


This can be shown as follows 
COV(X5, Vo3) 


= cov{(l—A)x, + Ax, wy, + (l- H)y} 


(1 — A)cov(%, ¥23) + Aucov(x,, Vz) 
+ (1 = p)cov(X,, V3) 


) S xy 
: N 


II 
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In the third line we used that cov(%, V>,,) = cov(X,, ¥;) = 
—S\,/N. This follows from the conditional covariance 
formula 
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cov(X,, V3) = E {cov(X,, V;|8,)}+ cov{E(X,|s,), E(y,| Si} 
i i Vo 
0+ cov eed 


Le 


xy 


ty ae) 
2_ cov(X,, y,) = -——. 
<7 (X,, Vz) N 


For an alternative proof based on the sampling autocorrela- 
tion coefficient, see Knottnerus (2003, page 375). 


References 


Berger, Y.G. (2004). Variance estimation for measures of change in 
probability sampling. The Canadian Journal of Statistics, 32, 451- 
467. 


Brodie, P. (2003). Review of recent work on variance estimation 
methods in business surveys. Unpublished report, Office for 
National Statistics, London. 


Cox, D.R., and Hinkley, D.V. (1974). Theoretical Statistics. London: 
Chapman and Hall. 


Hidiroglou, M.A., Sarndal, C.-E. and Binder, D.A. (1995). Weighting 
and estimation in business surveys. In Business Survey Methods, 
(Eds., B.G. Cox et al.). New York: John Wiley & Sons, Inc. 


Holt, D., and Skinner, C.J. (1989). Components of change in repeated 
surveys. /nternational Statistical Review, 57, 1-18. 


Holt, D., and Smith, T.M.F. (1979). Poststratification. Journal of the 
Royal Statistical Society, A, 142, 33-46. 


Kish, L. (1965). Survey sampling. New York: John Wiley & Sons, Inc. 


Knottnerus, P. (2003). Sample Survey Theory: Some Pythagorean 
Perspectives. New Y ork: Springer-Verlag. 


Knottnerus, P., and Van Delden, A. (2006). Estimation of changes in 
repeated surveys and their significance, http://www. iser.essex.ac. 
uk/ulsc/mols2006/programme/data/paper/Knottnerus.doc. 


Konschnik, C.A., Monsour, N.J. and Detlefsen, R.E. (1985). 
Constructing and maintaining frames and samples for business 
surveys. Proceedings of the Survey Research Methods Section, 
American Statistical Association, 113-122. 


Laniel, N. (1987). Variances for a rotating sample from a changing 
population. Proceedings of the Survey Research Methods Section, 
American Statistical Association, 496-500. 


Lowerre, J.M. (1979). Sampling for change. Proceedings of the Survey 
Research Methods Section, American Statistical Association, 343- 
347. 


Nordberg, L. (2000). On variance estimation for measures of change 
when samples are coordinated by the use of permanent random 
numbers. Journal of Official Statistics, 16, 363-378. 


Qualité, L., and Tillé, Y. (2008). Variance estimation of changes in 
repeated surveys and its application to the Swiss survey of value 
added. Survey Methodology, 34, 173-181. 


Sarndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted 
Survey Sampling. New Y ork: Springer-Verlag. 


Smith, P., Pont, M. and Jones, T. (2003). Developments in business 
survey methodology in the Office for National Statistics, 1994- 
2000. The Statistician, 52, 257-295. 


Tam, S.M. (1984). On covariances from overlapping samples. The 
American Statistician, 38, 288-289. 


Wood, J. (2008). On the covariance between related Horvitz- 
Thompson estimators. Journal of Official Statistics, 24, 53-78. 


Survey Methodology, June 2012 
Vol. 38, No. 1, pp. 53-62 
Statistics Canada, Catalogue No. 12-001-X 


53 


Variance inflation factors in the analysis of complex survey data 


Dan Liao and Richard Valliant ' 


Abstract 


Survey data are often used to fit linear regression models. The values of covariates used in modeling are not controlled as 
they might be in an experiment. Thus, collinearity among the covariates is an inevitable problem in the analysis of survey 
data. Although many books and articles have described the collinearity problem and proposed strategies to understand, 
assess and handle its presence, the survey literature has not provided appropriate diagnostic tools to evaluate its impact on 
regression estimation when the survey complexities are considered. We have developed variance inflation factors (VIFs) 
that measure the amount that variances of parameter estimators are increased due to having non-orthogonal predictors. The 
VIFs are appropriate for survey-weighted regression estimators and account for complex design features, e.g., weights, 
clusters, and strata. Illustrations of these methods are given using a probability sample from a household survey of health 


and nutrition. 


Key Words: Cluster sample; Collinearity diagnostics; Linearization variance estimator; Survey-weighted least squares; 


Stratified sample. 


1. Introduction 


Collinearity of predictor variables in a linear regression 
refers to a situation where explanatory variables are 
correlated with each other. The terms, multicollinearity and 
ill conditioning, are also used to denote the same situation. 
Collinearity is worrisome for both numerical and statistical 
reasons. The estimates of slope coefficients can be numer- 
ically unstable in some data sets in the sense that small 
changes in the X’s or the Y ’s can produce large changes 
in the values of these estimates. Statistically, correlation 
among the predictors can lead to slope estimates with large 
variances. In addition, when X_’s are strongly correlated, 
the R* in a regression can be large while the individual 
slope estimates are not statistically significant. Even if slope 
estimates are significant, they may have signs that are the 
opposite of what are expected (Neter, Kutner, Wasserman 
and Nachtsheim 1996). Collinearity may also affect fore- 
casts (Smith 1974; Belsley 1984). 

In experimental designs, it may be possible to create 
situations where the explanatory variables are orthogonal to 
each other. But, in many surveys, variables that are substan- 
tially correlated are collected for analysis. For example, total 
income and its components (e.g., wages and salaries, capital 
gains, interest and dividends) are collected in the Panel 
Survey of Income Dynamics (http://psidonline.isr.umich. 
edu/) to track economic well-being over time. When one 
explanatory variable is a linear combination of the others, 
this is known as perfect collinearity (or multicollinearity) 
and is easy to identify. Cases that are of interest in practice 
are ones where collinearity is less than perfect but still 
affects the precision of estimates (Kmenta 1986, section 
10.3). 


Although there is a substantial literature on regression 
diagnostics for non-survey data, there is considerably less 
for survey data. A few articles in the last decade introduced 
techniques for the evaluation of the quality of regression on 
complex survey data, mainly on identifying influential 
points and influential groups with abnormal data values or 
survey weights. Elliot (2007), for instance, developed 
Bayesian methods for weight trimming of linear and gener- 
alized linear regression estimators in unequal probability-of- 
inclusion designs. Li (2007a,b); Li and Valliant (2009, 
2011) adapted and extended a series of traditional diagnostic 
techniques to regression on complex survey data, mainly on 
identifying influential observations and influential groups of 
observations. Li’s research covers residuals and leverages, 
DFBETA, DFBETAS, DFFIT, DFFITs, Cook’s Distance 
and the forward search approach. Although an extensive 
literature in applied statistics provides valuable suggestions 
and guidelines for data analysts to diagnose the presence 
of collinearity (e.g., Farrar and Glauber 1967; Theil 
1971; Belsley, Kuh and Welsch 1980; Fox 1984; Belsley 
1991), none of this research touches upon diagnostics for 
collinearity when fitting models with survey data. 

The variance inflation factor (VIF) described 1n section 2, 
is one of the most popular conventional collinearity diag- 
nostic techniques, and is mainly aimed at ordinary or 
weighted least squares regressions. A VIF measures the 
inflation of the variance of a slope estimate caused by the 
nonorthogonality of the predictors over and above what the 
variance would be with orthogonality. In section 3, we 
consider the case of an analyst who estimates model 
parameters using survey-weighted least squares (SWLS) 
and derive VIFs appropriate to SWLS. The components of 
the VIF can be estimated using the ingredients of a variance 


1. Dan Liao, RTI International, 701 13" Street, N.W., Suite 750, Washington DC, 20005. E-mail: dliao@rti.org; Richard Valliant, University of Michigan 
and University of Maryland, Joint Program in Survey Methodology, 1218 Lefrak Hall, College Park, MD, 20742. 
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estimator that is in common usage in software packages for 
analyzing survey data. In the case of linear regression, a 
type of sandwich variance estimator will estimate both the 
model variance and design variance of the SWLS slope 
estimator. As we will show in section 3, the model or design 
variance of Be an estimator of slope associated with the 
predictor x,, is inflated somewhat when different pre- 
dictors are correlated with each other compared to what the 
variance would be if x, were orthogonal to the other pre- 
dictors. The measure of inflation, the VIF, is composed of 
terms that must be estimated from the sample. Our approach 
has been to substitute estimators that have both a model and 
design interpretation as described in section 3.5. 

The fourth section presents an empirical study using data 
from the United States National Health and Nutrition Exam- 
ination Survey. The application of our new approach is 
demonstrated and the newly-derived VIF values for SWLS 
are compared to the ones for OLS or WLS, which can be 
obtained from the standard statistical packages. The 
comparisons show that VIF values are different for different 
regression methods and a VIF specific to complex sample 
should be used to evaluate the harmfulness of collinearity in 
the analysis of survey data. 


2. Collinearity diagnostics in ordinary least 
squares estimation 


Suppose the sample s has 7 units, on each of which p 
x ’s or predictors and one analysis variable Y are observed. 
The standard linear model in a nonsurvey setting is 
Y = XB +e, where Y is an nx1 vector of observations 
on a response or dependent variable; X = (X,,..., X 2» is an 
nx p design matrix of fixed constants with x,, the nx 1 
vector of values of explanatory variable k for the n sample 
units; B is a px1 vector of parameters to be estimated; 
and € is an nx1 vector of statistically independent error 
terms with zero mean and constant variance o*. We as- 
sume, for simplicity, that X has full column rank. The 
ordinary least squares (OLS) estimate of B is p =X 0X in 
X'Y, for which the model variance is Var, (B) <e 
o (X’X)'. Here, we use the subscript M to denote ex- 
pectation under the model. 

Collinearities of explanatory variables inflate the model 
variance of the regression coefficients compared to having 
orthogonal X ’s. This effect can be seen in the formula for 
the variance of a specific estimated non-intercept coefficient 
6. (Theil 1971), 


Var, (6,). = <= (1) 
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where R, is the square of the multiple correlation from the 
regression of the k" column of X on the other columns. 
This R-square defined as R, = Bomedat Xcel iy Seer 
where f,,, is OLS estimate of the slope when x, is 
regressed on the other x’s and X,,) is the X matrix with 
the k" column removed. The term o° />\x;, is the model 
variance of Ga if the k" predictor were orthogonal to all 
the other predictors. The value of R may be nonzero be- 
cause the k"" predictor is correlated with one other explana- 
tory variable or because of a more complex pattern of depen- 
dence between x, and several other predictors. Conse- 
quently, the collinearity between x, and some other explan- 
atory variables can result in the inflation of the variance of 
6 , beyond what would be obtained with orthogonal X’s. 
The second term in (1), (1 — R; )', is called the variance- 
inflation factor (VIF) (Theil 1971). 

A basic reference on collinearity and other OLS diag- 
nostics is Belsley ef al. (1980). Collinearity diagnostics are 
covered in many other textbooks including Fox (1984) and 
Neter ef al. (1996). In some cases, it is desirable to weight 
cases differentially in a regression analysis to incorporate a 
nonconstant residual variance. This form of weighting is 
model-based and is called weighted least squares (WLS). 
Most of current statistical software packages, (e.g., SAS, 
Stata, S-Plus and R), use (1 —Ri fie 2 as VIF for WLS, 
where Ri wis) 18 the square of the multiple correlation from 
the WLS regression of the k" column of X on the other 
columns. Fox and Monette (1992) also generalized this 
concept of variance inflation as a measure of collinearity to 
a subset of parameters in b and derived a generalized 
variance-inflation factor (GVIF). Furthermore, some inter- 
esting work has developed VIF-like measures, such as 
collinearity indices in Steward (1987) that are simply the 
square roots of the VIFs and tolerance defined as the 
inverse of VIF in Simon and Lesage (1988). 


3. VIF in survey weighted least squares regression 


3.1 Survey-weighted least squares estimators 


Suppose the underlying structural model in_ the 
superpopulation is Y = X’B +e, where the error terms in 
the model have a general variance structure e ~ (0, 0° V) 
with known V and o*. Define W to be the diagonal 
matrix of survey weights. We assume throughout that the 
survey weights are constructed in such a way that they can 
be used for estimating finite population totals. The survey 
weighted least squares (SWLS) estimator is Bow =X 
WX) |X’ WY, assuming X’W'X is invertible. Fuller 
(2002) describes the properties of this estimator. 

The estimate Boy is a model unbiased estimator of B 
under the model Y= X’B+e_ regardless of whether 
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Var,,(e) = 0°V_ is specified correctly or not, and is 
approximately design-unbiased for the census parameter 
B,, = (X;X,) | X;Y,, in the finite population of N 
units. The subscript U stands for the finite population, 
Vi= Os-Yy) » and Xp =(x,...,X,). with x, as the 
N x1 vector of values for covariate k. 


3.2 Model variance of coefficient estimates 


The model variance of the parameter estimator i 
assuming Var,,(e) = oV, can be expressed as 


Vary, (Bsw’) “= o (X7X) | NOV X(XOX)T 
= ABA's. = Gor, (2) 


where X = W'?X, V = W'?vw'"?, A = X’X, B= X’ 
Vand G= ABA 
If the columns of X are orthogonal, then X’X = 
diag(X/ X,) and A’| = diag(1/X/X,), where %, = Ww, 
x,. The ij" element of G then becomes &/ WX, /(X; &,)”. 
Thus, when the X ’s are orthogonal, the model variance of 
Bow, is 
Vaty Bow, ) = 0% VI, /R RY (3) 
a fact we will use later. More generally, the model variance 


of Bx , the coefficient estimate for the k' explanatory 
variable, is 


Vary Bow, ) = i, Vaty,(Bgw)i, = 0 i, Gi, = og (4) 


where i, is a p x1 vector with | in position k and 0’s 
elsewhere, and g“ is the k” diagonal element of 
matrix G. 


3.3. Model-based VIF 


As shown in Appendix A, the model variance of cn in 
(4) can be written as: 


ega R eae 
A ee ee ree Seay oe 
Vary, (Boy, ) = gto? = ES A 65) 
l= Rowe (x; X,) 
where 
é’ Ve e’ WVWe 
GC eee ts Khe Ss exh xk 
( ape T ? 
Ce e., We,, 


with e, = xX; — XQ Bswi,) being the residual from SWLS 
regressing x, on X,, and €, =k, — Xq@Powa) = 
1/2 
ave 6, 
if 
x, Wx, 


x, WV Wx, 


%, & 
Px y i ee = 
x, V4, 


and Ré\,,, defined in Appendix A, is the square of the 
multiple correlation from the weighted regression of the k™ 
column of X on the other columns. Hence, ¢, and p, 
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depend on W and V. The variance under orthogonality in 
(3) is inflated 


VIF, = —cH2e— (6) 


times when incorporating the other p-—1 explanatory 
variables in SWLS. The model-based VIF in SWLS in- 
cludes not only the multiple correlation coefficient Réy,,, 
but also two adjustment coefficients, C, and p,, that are 
not present in the OLS and WLS cases. 

Using the singular value decomposition of V, we can 
bound the factor C,e,, which is the adjustment to the VIF in 
WLS. Based on the extrema of the ratio of quadratic forms 
(Lin 1984), the term C, is bounded in the range of 
Linin(W) < CG, < Umax (V), and p, is bounded in the range 
of 

| | 


Se 
aa : Hinin (VY) 


where u,,.(W) and u,,.(W) are the minimum and maxi- 
mum singular values of the matrix V. Combining these 
results, the joint coefficient ¢,p, is bounded in the range 
of: 

Min (V) Sitio Max (V) 
Hmax CW) Lmin CV ) 


Notice that when V= I, C, = p, = | and (6) reduces to 


t on 


2 T ’ 
i Rowe) x, Wx, 


which is the model variance of the WLS estimates when V 
is diagonal and W is correctly specified as W = V". In 
that unusual case, the VIF currently computed by software 
packages will be appropriate for SWLS. However, rarely 
will it be reasonable to think that W = V' in survey 
estimation. If V #1, then ¢, and p, are not equal to | 
and a specialized calculation of the VIF is still needed. 
When V = I, which is the usual application considered by 
analysts, 


ze ils 
Gi Syee te e., We,, _ XX 
2K: ip eS hak xo Wx 
C5 xh X; X, 

and 

Lin (VW) — Winin 

fey) Wrnax 
where w,,,, 18 the minimum value of the survey weights and 
Wax 1S their maximum value. In this case, the range of 


C.P, 1s bounded by 
Winin Wax 
Sa Ga eT ee 
Wax Win 
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When all the survey weights are constant, C,p, = 1 and the 
VIF produced by standard software, (1 — Re yah does not 
need to be adjusted in SWLS; however, when the range of 
the survey weights is large, C,e, can be very small or large 
and can be either above or below 1. In this case the VIF 
produced by standard software is not appropriate and a 
special calculation is needed. These facts will be illustrated 
in our experimental studies. 

The VIF in (6) is appropriate regardless of whether the 
model contains an intercept or not. An alternative version 
can also be written that assumes that an intercept is in the 
model when x, is regressed on the other x’s. The 
derivation of this form is in Liao (2010). We summarize the 
result below. 

The variance of Bow, in a model M2 that includes an 
intercept and in which x, is orthogonal to the other x’s is: 


(%, —1%,)' W(&, —1%,) 


3 (7) 
SSTsy | (k) 


Vato (Bow, eas 


oases 2 oe pauper ia ae 

Where" (5. Wd, te ke NG A = ee 
aligns r=2 A 

and SSTsw x) =<, XxX, UV «,. We vatianice OL Bsw, can 


then be rewritten as 


Vat lye Oe Co ae © 


l ~ *\Swm(k) 


where Nena ,) 18 the SWLS R-square from regressing X, 
on the x’s in the remainder of X (excluding a column for 
the intercept). The term C, was defined following (5) and 


— X e C 
Pink = ee a 


Most software packages will consistently provide (1 — 
Rea) as the VIF as part of WLS regression output. 
Note that this is different from the VIF, (1— Rowqy) 
introduced in section 3.3 which does not assume that an 
intercept is retained in the model. Software packages 
generally do not supply (1 — Rasy! 

Using arguments similar to those in the previous section, 
we can bound ¢,9,,, by 


min (V) max 
Emin \¥) < ci deae < Bmax) 
Minax CW ) Honin (V) 


The model variance of Bow, is inflated by 


(Y) 


Cx P mk 


2 
I ha Rewmik 


¥ I Fix = 
) 


compared to its variance in the model (M2) with only the 
explanatory variable X, and intercept. The new intercept- 
adjusted VIF, retains some properties of VIF, in (6). 
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When V = I, we have C, = 1, 1=,,, = 1 and the intercept- 
adjusted VIF in (8) for SWLS is equal to the conventional 
intercept-adjusted VIF: (1— R*,))'. When V =I, we 
have V = W, 


C= eu Wes Al sald eA Silat 
Co ©xk (x, —1%,) W(x, —1%,) 
and 
LH inin (V) — Wain 


Uinax (V) Wax 


The range of C,/,,, also depends on the range of survey 
weights as did C, p,. 


3.4 Estimating the VIF for a model with stratified 
clustering when V is unknown 


In the previous sections, we used model-based arguments 
to derive VIFs. The VIFs contain terms, V_ in particular, 
that are unknown and must be estimated. In this section, we 
construct estimators of the components of the VIFs, again 
using model-based arguments. However, a standard, design- 
based linearization variance estimator also estimates the 
model variance, as shown below, and supplies the compo- 
nents needed to estimate the VIF. In the remainder of this 
section, we will present estimators that are appropriate for a 
model that has a stratified clustered covariance structure. 

Suppose that in a stratified multistage sampling design, 
there are h = 1,...,H strata in the population, 7 = 1,..., 
N,, clusters in the corresponding stratum / and ¢ = 1,..., 
M,, units in cluster hi. We select i = 1,..., 7, clusters in 
stratum / and ¢ = 1,..., m,; units in cluster Ai. Denote the 
set of sample clusters in stratum / by s, and the sample of 
units in cluster hi as s,,. The total number of sample units 
in stratum / is m, = Dies, ni and the total in the sample is 
m = >j_,m,. Clusters are assumed to be selected with 
replacement within strata and independently between strata. 
Consider this model: 


Ey Yui) = XB 
| Ricans Veile ow Oe 
Coviy Voir Ynir) a 0 
h¢th,or,h=h' andizi. 


Peal aaa tN 


Units within each cluster are assumed to be correlated but 
the particular correlation of the covariances does not have to 
be specified for this analysis. The estimator of the regression 
parameter is: 


H 
Boy = De An Xi, Wii Yi (10) 


h=| ies, 


where X,, is the m,. x p matrix of covariates for sample 
units in cluster hi, W,, = diag(w,), f € s,, is the diagonal 
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matrix of survey weights for cluster Ai and Y,, is the 
m,, x1 vector of response variables in cluster hi. The 
model variance of B.y, 1s: 


H 
Vary (Bsw) = AY p Oy Xi, Wai Vai Wai as 


h=\ 1esp 


= A”! 


H 
SNL WV WA, | (11) 


h=| 


mete Vea Val, oY.) and Vo = Bikdiag( Vs) secs, 
Expression (11) is a special case of (2) with X’ = (X; 
X!,..., X;,), X, is the m, x p matrix of covariates for 
sample units in stratum h, W = diag(W,,), for  =1,..., 
H andieés, and V = Blkdiag(V¥, ). 


Denote the cluster-level residuals as a vector, e,, = 


Y,, — X,,; Bsw- A design-based linearization estimator is: 
Q me a) — = \l —| 
vat, (Bw )=A 3 : DAZE —Z,)(Zyi-Z,) | A 
itt ies), 
a ; 
Saye h if ——— sh -| 
= yy bhi —N,Z,Z, \|A, (12) 
Pea Nn), ~s ies, 
where 
4 1 
Z), wy eee 


BOGE Zier WV Oyo VID Cp = Na ors Xoo Bayee This: ex: 
pression can be reduced to the formula for a single-stage 
stratified design when the cluster sample sizes are all 
equal to 1, m,,=1. Expression (12) is used by the Stata 
and SUDAAN packages, among others. The estimator 
var, (Bey) is consistent and approximately design-unbiased 
under a design where clusters are selected with replacement 
(Fuller 2002). Li (2007a, b) showed that (12) is also an 
approximately model-unbiased estimator under model (11). 
The term in brackets in (12) serves as an estimator of the 
matrix B in expression (2). The components of var, (Bow) 
can be used to construct estimators of ¢, and p, in (5) and 
P,,, 10 (8). In particular, 
» _ e WVWe,, 


= : 13 
=: ey Wer a 


where 


Vv = ein “ ¥, 
a | 


sea 
mame P| Wand Naat arated a hs 
n 


h 
with V, = Blkdiag(e,,e,,) and 


le 
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defined 


mk? 


with e,, = x, - EON ere The estimate of p 
following (8), is 

m (x7 Wx, — NX) 

P mk ¥ ae i Ae : miley (14) 


Given these component estimators VIF, is estimated by 


ire iP 
I Rowe) 
and VIF, is estimated by 
VIPS. za Ch vas 
a Rewmck) 


4. Experimental study 


We will now illustrate the proposed, modified collinearity 
diagnostics and investigate their behavior using dietary 
intake data from the National Health and Nutrition Exami- 
nation Survey (NHANES) 2007-2008 (http://www.cdc.gov/ 
nchs/nhanes/nhanes2007-2008/datadoc_changes_0708.htm). 
The dietary intake data are used to estimate the types and 
amounts of foods and beverages consumed during the 24- 
hour period prior to the interview (midnight to midnight), 
and to estimate intakes of energy, nutrients, and other food 
components from those foods and beverages. NHANES 
uses a complex, multistage, probability sampling design. 
Oversampling of certain population subgroups is done to 
increase the reliability and precision of health status 
indicator estimates for these groups. Among the respondents 
who received the in-person interview in the mobile exam- 
ination center (MEC), around 94% provided complete di- 
etary intakes. The survey weights in this data were con- 
structed by taking MEC sample weights and further 
adjusting for the additional nonresponse and the differential 
allocation by day of the week for the dietary intake data 
collection. These weights are more variable than the MEC 
weights. The data set used in our study is a subset of 2007- 
2008 data composed of female respondents aged 26 to 40. 
Observations with missing values in the selected variables 
are excluded from the sample which finally contains 672 
complete respondents. The final weights in our sample 
range from 6,028 to 330,067, with a ratio of 55:1. The U.S. 
National Center for Health Statistics recommends that the 
design of the sample is approximated by the stratified 
selection with replacement of 32 PSUs from 16 strata, with 
2 PSUs within each stratum. 

For this empirical study, a linear regression of body 
weight(kg) is fitted using survey weighted least squares. The 
predictor variables considered include age, Black(race) and 
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nine daily total nutrition intake variables, which are 
calorie(100kcal), protein(100gm), carbohydrate(100gm), 
sugar(100gm), total fat(100gm), total saturated fatty 
acids(100gm), total monounsaturated fatty acids(100gm), 
total polyunsaturated fatty acids(100gm) and_alco- 
hol(100gm). All the daily total nutrition intake variables 
are correlated with each other to different degrees as 
shown in Figure 1. 

Three regression methods compared in this study. The 
first one uses ordinary least squares (OLS) method and 
ignores sampling complexities including the weighting. The 
second one uses weighted least squares (WLS), which 
incorporates the survey weights by assuming V = W | but 
ignores all sampling complexities. The third one is survey 
weighted least squares (SWLS), which uses the actual 
complex sampling design as described in section 3.4. The 
weight matrices, coefficient variance estimators and 
collinearity diagnostics of these three methods are listed 
in Table 1. 

The results from fitting the model using three different 
regression methods are displayed in Table 2. The model 
with all the predictors is shown in the upper part of the table. 
In the lower tier of the table, a reduced model with less of 
the near-dependency problem is fitted with only three 
predictors: age, Black and calorie. In the reduced model, the 
value of the coefficient for calorie is positive and significant 
when WLS or SWLS is used, which seems logical and 
reflects the anticipated positive relationship between a 
respondent’s body weight and her daily total calorie intake. 
However, when the other total nutrition intake variables are 
included in the model, the value of the calorie coefficient is 
negative and not significant due to its inflated variance. This 
is a typical example in which the variance of a coefficient is 
inflated, and its sign is illogical due to collinearity. 


Table 3 reports the VIF values when the three different 
regression methods are used. The VIF formulas for these 
regression methods are listed in Table 1. When all the 
predictors are included in the model, calorie has the largest 
VIF values in all the regressions due to its high near- 
dependency with all the other total nutrition intake variables. 
As shown in Table 1, the VIF in SWLS can be obtained by 
multiplying the VIF from WLS with the adjustment 
coefficient €,p,. In Table 3, the adjustment coefficients 
C.P, for all the non-fat total nutrition intake variables are 
all less than 1, especially the one for carbohydrate which is 
0.46. This indicates that the VIF values for these variables in 
SWLS are much smaller than the ones in WLS and the 
collinearity among predictors in the model has less impact 
on the coefficient estimation when using SWLS, compared 
to using WLS. But for the fat-related nutrition intake 
variables, their C,p, are all larger than 1. Thus, the 
collinearity among the fat-related nutrition intake variables 
is more harmful to the coefficient estimation in SWLS than 
in WLS. To take a closer look at this problem, we also fitted 
a model that only contains two nutrition intake variables: 
total fat and total monounsatruated fatty acids. The SWLS 
VIF values are three times as large as the ones from OLS or 
WLS for these two nutrition variables. If an analyst is 
analyzing this survey data using SWLS but uses the 
unadjusted VIF values provided by standard statistical 
packages for either OLS (as shown in the first column) or 
WLS (as shown in the second column), the unadjusted VIFs 
will give somewhat misleading judgements on the severity 
of collinearity in this model. In summary, although the 
estimated slopes and predictions in regression using WLS 
and SWLS are the same, the VIFs can be underestimated or 
overestimated if survey complexities are ignored. 


Table 1 
Regression methods and their collinearity diagnostic statistics used in this experimental study 
Regression Type Weight Matrix W“ Variance Estimation of B VIF fomula 
OLS I 2 ei b 
Oo (XX) VIF = 1 
2 
Le REA 
WLS we 62(x’ wx)! ne 1 
2 
i Rswm(k) 
SWLS Ww 62(x’ wx) x’wW wx(x? wx)? oa Ck Bk 
‘i #4 
ne RSwmn(k) 


with 


a H n ; if ] T 
V = > | Blkdiag(eni eh:) - —en eh 


in, | 


inl: 

5 EY VW Go 

with 6, = ak, 
Cre We xk 

ala Gx) 

nk, Sta tas eee 

Ap (X; = 1x;,) V(X; = 1x) 


: In all the regression models, the parameters are estimated by: 6 = (X’ wx) |x’ wy. 
’ R2,. is the OLS R-square from regressing x, on the x’s in the remainder of X (excluding a column for the intercept). 


* W is the diagonal matrix with survey weights w, on the main diagonal. 
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0 2 4 


02468 01234 


0.01.0 2.0 


CLOe 120 


0.0 0.4 


0 


* T.Fat: total fat; 
T.S.Fat: total saturated fatty acid; 


T.M.Fat: total monounsaturated fatty acid; 
T.P.Fat: total polyunsaturated fatty acid. 


Table 2 


Parameter estimates with their associated standard errors using three different regression methods 


Variable 


Intercept 

Age 

Black 

Calorie 

Protein 

Carbohydrate 

Sugar 

Fiber 

Alcohol 

Total Fat 

Total Saturated Fatty Acids 

Total Monounsaturated Fatty Acids 
Total Polyunsaturated Fatty Acids 


Variable 
Intercept 
Age 
Black 


59 
8 6. £5 0.0 1.0 0.0 04 
| 0.47 FS 
Figure 1 Pairwise scatterplots and correlation coefficients of nutrition variables“ 
Full Model 
OLS WLS SWLS 
Beta SE. Beta SE. Beta SE. 
63.9075 6.95 G7 4]*** 6.36 OAT t* 8.76 
0.26 0.19 0.08 0.18 0.08 0.25 
LO3S9e*s 2.07 Leo 2.38 | aos Aches DByAY 
-6.41 5.76 -8.19 5.56 -8.19 DYT5 
2592 24.76 40.98 23.60 40.98 ZI38 
26.67 Zea 98 32.31 2296 323 22.65 
-1.90 3.06 -0.30 2.82 -0.30 4.06 
-41.17 20.23 -34.20 17.98 -34.20 19.05 
38.84 39.45 49.37 38.28 49.37 40.10 
150x25* 69.53 161.78* (PIO 161.78 94.76 
-113.20* 49.81 -101.40 56.26 -101.40 82.71 
-72.05 48.03 -92.44 ay Poy -92.44 83.51 
-92.60* 46.13 -75.55 SING -75.55 78.76 
Reduced Model 
OLS WLS SWLS 

Beta SE. Beta SE. Beta SE. 
C2207" 6.88 Oy oe 6.29 Gy.527e% 8.48 
0.27 0.19 0.07 0.18 0.07 0.25 
12 54*** 1.98 Ag i as 22 1s oI aba 2.05 
0.15 0.10 O25" 0.09 0.23* 0.10 


Calorie 


“ p values of significance: * p = 0.05; ** p = 0.01; *** p = 0.005. 
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Table 3 
VIF values using three different regression methods 


Full Model 

OLS WLS SWLS 
Variable VIF VIF VIF CrPx 
Age 1.02 1.03 0.96 0.94 
Black 1.10 1.07 f.lD 1.05 
Calorie 3,411.61 3,562.70 2,740.83 Ware 
Protein 123.12 127.35 103.50 0.81 
Carbohydrate 1,074.87 1,007.40 462.08 0.46 
Sugar S30 7.03 4.87 0.69 
Fiber 4.59 3.94 2. 3a) 0.60 
Alcohol 120.56 115.67 89.92 0.78 
Total Fat 1,190.24 1,475.27 Zh s. Oo 1.70 
Total Saturated Fatty Acids 76.80 112.61 202.91 1.80 
Total Monounsaturated Fatty Acids Sao) 107.34 286.24 2.67 
Total Polyunsaturated Fatty Acids 34.73 49.45 118.21 Zo 

Reduced Model 

OLS WLS SWLS 
Variable VIF VIF VIF Cx Px 
Age 1.00 1.00 0.98 0.98 
Black 1.02 1.01 0.97 0.96 
Total Fat 20.10 ZO22 63.15 Sil2 
Total Monounsaturated Fatty Acids 20.16 20.26 61.57 3.04 

Reduced Model 

OLS WLS SWLS 
Variable VIF VIF VIF CrP x 
Age 1.00 1.00 0.98 0.97 
Black 1.00 1.03 1.00 1.00 
Calorie 1.00 1.01 0.96 0.95 


5. Conclusion 


Regression diagnostics need to be adapted to be 
appropriate for models estimated from survey data to 
account for the use of weights and design features like 
stratification and clustering. In this paper we developed a 
new formulation for a variance inflation factor (VIF) 
appropriate for linear models. A VIF measures the amount 
by which the variance of a parameter estimator is inflated 
due to predictor variables being correlated with each other, 
rather than being orthogonal. Although survey-weighted 
regression slope estimates can be obtained from weighted 
least squares procedures in standard software packages, the 
VIFs produced by the non-survey routines are incorrect. The 
complex sample VIF is equal to the VIF from weighted 
least squares times an adjustment factor. The adjustment 
factor is positive but can be either larger or smaller than 1, 
depending on the nature of the data being analyzed. 

In an empirical study, we illustrated the application of 
our new approach using data from the 2007-2008 National 
Health and Nutrition Examination Survey. We provided a 
simple example of how the collinearity among predictors 
affects the estimation of coefficients in linear regression and 
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demonstrated that although the estimated coefficients (and 
fitted values) are the same when weighted least squares or 
survey-weighted least squares are used, their estimated 
variances and VIF values (reflecting the impact of 
collinearity on coefficient estimation) can be different. 

The goals of an analysis must be considered in deciding 
how to use VIFs. If prediction is the main objective, then 
including collinear variables or selecting incorrect variables 
is less of a concern. If more substantive conclusions are 
desired, then an analyst should consider which variables 
should logically be included as predictors rather than relying 
on some automatic algorithm for variable selection. VIFs 
are a useful tool for identifying predictors whose estimated 
coefficients have variances that are unnecessarily large. 
Although VIFs might be considered as a tool for automatic 
variable selection, simulations in Liao (2010), not reported 
here, show that using VIFs is not a reliable way of 
identifying a true underlying model. 
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Appendix A 


Derivation of oo 


Similar to the derivation of conventional OLS VIF in 
Theil (1971), the sum of squares and cross products matrix 
A = X’X, which can be partitioned as 


Ais BOE 
x, X »4 

egg Bn sat (15) 
Xin Xs Xi) Xx) 


where the columns of X are reordered so that X = 
(X, X 4) with x being the nx(p-—1) matrix con- 
taining all columns except the k" column of X. 

Using the formula for the inverse of a partitioned matrix, 
the cai element of A7' can be expressed as: 


={ Ali, =i,(X’ X41, 
7 1 
] ~ Rew SSTyy , 
(! e Hay} XK; X; 
where 
; Bow, X 7 Xu Bsw,,, 
les = es 
) SSTsw 
with Bow, ' = (Xi) Xu) Xk, is the coefficient of de- 


termination corresponding to the regression of x, on the 
B —1 other explanatory variables. The term SST,y, 
X/ X,, is the total sum of suas in this regression. ss 
The term (1— Rj, )' in (16) is the VIF that will be 
produced by oical areal packages when a weighted 
least squares regression is run. Under the model Y = XP + 
e with e~ (0,0 W), expression (16) is equal to 
Naty Bsw, )/o*. However, this is not appropriate for survey- 
weighted least squares regressions because the variance of 
be has the more complex form in (2). 
The matrix G = A“'BA can be expressed as: 


Be gh akt®) by Die ff a® ak " 

ak 5” Nba, Bay Lae’ sA©® ao) 
where the inverse matrix is A’) =[a “h bite 210, Ds 
a‘) is defined as the k"” row of A! excluding a™, 
(a®,..., kD, hE, gi?) al = [ak] and AMM 
is defined as the (k —1) x (k —1) part of matrix A™' ex- 


cluding the k" row and column. The partitioned version of 
B is 


an bas ba. | _{ & VE, &VXuy es 
dae Bee KWo Ky ks 
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By virtue of the symmetry of A and B, the k" 
ile element of G is 

a OOD sae ra Baia pa Ado) 


Using the partitioned inverse of matrix A, which 
represents (X’X) |, it can be shown that 


(k)k kk eee tee oe or, 
ae = a (Xi Xe) Xen Ke = — 2 Bowe 20) 


Substituting a* in (19), 2” 
pressed in terms of a™, 
nent of matrix B: 


“= (a) (b, =D Beant + Bow ciB ober 

ik saree tere 

Po Row Xt ky 
x(&, VK,- 2%; VX Becton Boi SONS Obama) 
LEY fe = Saisie) V(x, = Xt, Page!) 
lcs oi 5 ot 
git . (a + Soba, MAES fa Xabow) 
(x a epara) (x, a Xba) 

es mak Ske (CAD) 


1 In @- Ve 
2 ee Soe 
1 Rowen) XX OO yy 


can be compactly ex- 
Bower) and the lower right compo- 


II 


Wwherese..= %, = Xx, «Bswce IS the residual from regressing 
x, on X,) 
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Estimating agreement coefficients from sample survey data 


Hung-Mo Lin, Hae-Young Kim, John M. Williamson and Virginia M. Lesser ' 


Abstract 


We present a generalized estimating equations approach for estimating the concordance correlation coefficient and the 
kappa coefficient from sample survey data. The estimates and their accompanying standard error need to correctly account 
for the sampling design. Weighted measures of the concordance correlation coefficient and the kappa coefficient, along with 
the variance of these measures accounting for the sampling design, are presented. We use the Taylor series linearization 
method and the jackknife procedure for estimating the standard errors of the resulting parameter estimates. Body 
measurement and oral health data from the Third National Health and Nutrition Examination Survey are used to illustrate 


this methodology. 


Key Words: Clustering; Concordance correlation coefficient; Generalized estimating equations; Jackknife estimator; 
Kappa coefficient; Sample weighting; Stratification; Taylor series linearization. 


1. Introduction 


Surveys often collect multiple measures of latent condi- 
tions such as quality of life and aspiration for a college 
education, as well as multiple measures of difficult- to- 
classify conditions such as having chronic fatigue syn- 
drome. When multiple measures are collected, interest 
naturally focuses on the agreement between the multiple 
measures and in obtaining confidence intervals on those 
agreement measures. Also, there may be interest in con- 
trasting agreement across population subgroups and across 
alternate pairings of measurements. In this context, one 
might be interested in testing equality of agreement 
measures. This paper focuses on two measures of agreement 
between such multiple measures, the concordance corre- 
lation coefficient (CCC, p,) and the kappa («) coefficient. 
The former is useful for continuous measurements with 
natural scales. If a measure of a latent concept has no natural 
scale, then it can be arbitrarily rescaled to have mean zero 
and unit variance. When this is possible, it is meaningless to 
talk about differences in marginal moments. However, if 
there is a natural scale, then rescaling is not desirable and a 
good measure of agreement will take into account both 
correlation and agreement of marginal moments. The kappa 
coefficient is most useful for binary classifications. 

The CCC has been shown to be more appropriate for 
measuring agreement or reproducibility (Lin 1989; Lin 
1992) than the Pearson correlation coefficient (p). It evalu- 
ates the accuracy between two readings by measuring the 
variation of the fitted linear relationship from the 45° line 
through the origin (the concordance line) and precision by 
measuring how far each observation deviates from the fitted 


line. Let Y,, and Y,, denote a pair of continuous random 
variables measured on the same subject 7 using two meth- 
ods. The CCC for measuring the agreement of Y,, and Y,, 


is defined as follows: 
EY - Voy as 
Endep YX F ae 


201) 


Oo; x Bs sie 0S re 15)” 
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where o; = var(Y,,), 0, = var(Y,,), and o,, = cov(Y,, 
Y,,) (Lin 1989). As noted by Lin (1989), p, = 0 if and 
only if p = 0. It can also be shown algebraically that p, is 
proportional to p and that -1 < —|p| <p, <|p| <1 (Lin 
1989). Hence imprecision can be reflected by a smaller p 
and systematic bias can be reflected by a smaller ratio of 
p. relative to p. Together, information on p and p, 
provide a set of tools to identify which corrective actions, 
either to improve accuracy and/or to improve precision, is 
most beneficial (Lin and Chinchilli 1997). 

The intraclass correlation coefficient (ICC) is also a 
popular measure of agreement for variables measured on a 
continuous scale (Fleiss 1986). Suppose Y,, and Y,, can be 
described in a linear model as follows: y,, =p, +9,+e, 
where pt, is the mean of the measurement from the 
jy" method, 8, ~ (0,64) is the latent variable for the i” 
subject, and the e, ~ (0, ©.) are independent errors terms. 
Carrasco and Jover (2003, page 850) used a model with 
variance components to demonstrate that the CCC is the 
intraclass correlation coefficient (ICC) when one takes into 
account the difference in averages of the methods: 
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Therefore, one can estimate the CCC using the variance 
components of a mixed effects model or the common 
method of moments. Because of its superiority to the 
Pearson correlation coefficient and its link to the ICC, 
application of the CCC has gained popularity in recent years 
(Chinchilli, Martel, Kumanyika and Lloyd 1996; Zar 1996). 
In 2009 and the 2010, the CCC was used as a measure of 
agreement in more than 60 medical publications in areas 
such as respiratory illness (Dixon, Sugar, Zinreich, Slavin, 
Corren, Naclerio, Ishii, Cohen, Brown, Wise and Irvin 
2009; Kocks, Kerstjens, Snijders, de Vos, Biermann, 
van Hengel, Strijbos, Bosveld and van der Molen 2010), 
sleep (Khawaja, Olson, van der Walt, Bukartyk, Somers, 
Dierkhising and Morgenthaler 2010), pediatrics (Liottol, 
Radaelli, Orsil, Taricco, Roggerol, Giann, Consonni, 
Moscal and Cetin 2010), neurology (MacDougall, Weber, 
McGarvie, Halmagyi and Curthoys 2009), and radiology 
(Mazaheri, Hricak, Fine, Akin, Shukla-Dave, Ishill, 
Moskowitz, Grater, Reuter, Zakian, Touijer and Koutcher 
2009). 

The kappa coefficient (1) (Cohen 1960) and the 
weighted kappa coefficient (Cohen 1968) are the most 
popular indices for measuring agreement for discrete and 
ordinal outcomes, respectively (Fleiss 1981). Let Y,, and 
Y,, denote two binary random variables taking values 0 and 
1 with probabilities denoted by =, = Pr(¥,, = 1) and 2,= 
Pr(Y,, = 1). Kappa corrects the percentage of agreement 
between raters by taking into account the proportion of 
agreement expected by chance (calculated under indepen- 
dence), and is defined as follows: 


Lee 
K=- eOoOrcr— 


p 
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where P, is the probability that the pair of binary responses 
are equal assuming independence (7,7, +(1 —1,)(1 —7,)) 
and P is the probability that the pair are equal (Cohen 
1960). The difference P, — P. is the excess of agreement 
over chance agreement. A value of 0 for « indicates no 
agreement beyond chance and a value of | indicates perfect 
agreement (Fleiss 1981). Disadvantages of kappa are that is 
a function of the marginal distribution of the raters (Fleiss, 
Nee and Landis 1979; Tanner and Young 1985) and its 
range depends on the number of ratings per subject (Fleiss 
et al. 1979). Robieson (1999) noted that the CCC computed 
from ordinal scaled data is equivalent to the weighted kappa 
when integer scores are used. Kappa has been used to 
measure the validity and reproducibility of the similarity 
between twins (Klar, Lipsitz and Ibrahim 2000), different 
epidemiologic tools (Maclure and Willett 1987), and 
control-informant agreement from case-control studies 
(Korten, Jorm, Henderson, McCusker and Creasey 1992). 
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The value of sample surveys have been well recognized 
and estimation for data collected from sample surveys has 
been widely documented (Hansen, Hurwitz and Madow 
1953; Cochran 1963; Kish 1965). For example, a number of 
federal studies conducted in the U.S. to obtain estimates of 
the health of the population are based on national surveys, 
such as the National Health Interview Survey (NHIS), the 
Behavioral Risk Factor Surveillance System (BRFSS), and 
the National Health and Nutrition Examination Surveys 
(NHANES). Each of these studies incorporates complex 
survey design structure, namely oversampling of subpopula- 
tions, stratification and clustering. These designs are often 
used to improve precision, provide estimates for subpopula- 
tions, or reduce costs associated with frame development. In 
order to draw design-based inference to the targeted 
population for complex survey designs, estimators and their 
variances include sampling weights and account for the 
design structure to obtain unbiased estimates. In addition, by 
including the sampling weights and incorporating the 
sample design in analyses, any potential correlation from the 
clusters in a multistage design is taken into account so that 
the standard errors of the estimators are not underestimated. 

Often researchers are not interested in testing whether 
their estimation of agreement using either the CCC or kappa 
is significantly different from zero. Their interest is to report 
the confidence intervals along with their estimates (e.g., 
Dixon etal. 2009; Mazaheri etal. 2009). Similar to the 
Pearson correlation coefficient, there is no target value that 
can be used to judge if agreement is strong. Therefore, it is 
essential that judgment of agreement between any test and 
reference methods should be made with an established 
degree of certainty. In some situations, studies are con- 
ducted that require hypothesis testing or comparisons of 
agreement indexes for more than one new methods against a 
reference method. For examples, Khawaja etal. (2010) 
tested the equality of two CCCs that compared the apnea 
hypopnea index (AHI) from the first 2 and 3 hours of sleep 
with the gold standard AHI from FN-PSG (FN-AHI). In 
radiology research, associations between volume measure- 
ments of prostate tumor from imaging and also from 
pathologic examination were assessed by comparing CCCs. 
The two imaging methods were tested for equality of 
agreement with the pathologic results (Mazaheri ef al. 
2009). Tests of equal kappa have been used to compare 
visual assessment and computerized planimetry in assessing 
cervical ectopy (Gilmour, Ellerbrock, Koulos, Chiasson, 
Williamson, Kuhn and Wright 1997; Williamson, 
Manatunga and Lipsitz 2000), and in comparing mono- 
zygotic and dizygotic twins in terms of cholesterol levels 
(Feinleib, Garrison, Fabsitz, Christian, Hrubec, Borhani, 
Kannel, Roseman, Schwartz and Wagner 1977). 
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As illustrated in the two NHANES III examples in 
Section 3, large differences can exist between the weighted 
and unweighted estimates of parameter estimate standard 
errors in survey studies. Failure to include sampling weights 
and take into account the sample design in analyses will 
result in underestimation of standard errors and incorrect 
inference. This is especially important for surveys repeated 
every few years, and researchers often have a special 
interest in comparing changes among domains or sub- 
populations. For instance, in the first NHANES III applica- 
tion, we compare the agreement between self reported and 
measured body weights at examination in adolescents. 
Computing accurate standard errors (confidence intervals) 
are necessary if interest is to compare the CCC across 
domains, such as normal weight and obese subgroups. 

We provide weighted measures of the CCC and kappa 
coefficient, along with the variance estimators of these 
measures accounting for the sampling design. In Section 2, 
we present a generalized estimating equations approach for 
estimating these two agreement coefficients from sample 
survey data. In Section 3, we illustrate our method with data 
collected from the NHANES III study. We use body 
measurement data to estimate p. for assessing the agree- 
ment between self-reported and actual weight. We also use 
oral health data to estimate « for assessing the agreement 
between two definitions of periodontal disease. We account 
for stratification and clustering, and incorporate weights of 
the survey design in both examples. We conclude with a 
short discussion. 


2. Methods 


We propose a general approach for estimating the CCC 
and kappa from sample survey data using two GEE 
approaches. For the CCC, three sets of estimating equations 
are required. A first set of estimating equations models the 
distribution of the continuous responses. Following Barnhart 
and Williamson (2001), a second set of estimating equations 
is used to estimate the variances of the continuous re- 
sponses. A third set of estimating equations estimates the 
CCC by modeling the covariance between the paired 
continuous responses and the estimates of the means and 
variances from the first two sets of estimating equations. For 
k, only two sets of estimating equations are required. A 
first set of estimating equations models the marginal 
distribution of the binary responses. Following Lipsitz, 
Laird and Brennan (1994), a second set of estimating 
equations is introduced to estimate « by modeling a binary 
random variable depicting agreement between two re- 
sponses on a subject. 

In order to account for variable selection probabilities, 
weight matrices are incorporated into each set of estimating 
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equations. Standard error estimation of the proposed p.. and 
& from sample survey data are conducted with the Taylor 
series linearization method. We also show how standard 
error estimation of the proposed estimators can be ac- 
complished by using the jackknife approach. 

Assume a sample survey is conducted with stratification, 
clustering, and unequal probabilities of selection. Let Y, 
denote the response variable for the 7’ member ( j= 
l,...,™,;) of the i cluster (i = 1,...,7,) of the h™ stra- 
tum (/ =1,...,H). Averaging over all possible samples, 
the corresponding expected value is E[Y,,,] =u, if Y,,, 1s 
a continuous response, and the corresponding probability 
Biles Poet a Ti if Y,,, 18 a binary response. 
The sampling weight w,,, is the inverse of the probability of 
selection for the j” 


Sette of the i” cluster of the A" 
Stratum. 


ni] 


2.1 The concordance correlation coefficient 


Liang and Zeger (1986) developed moment-based 
methods for analyzing correlated observations from the 
same cluster (e.g., repeated measurements over time on the 
same individual or observations on multiple members of the 
same family). The GEE approach results in consistent 
marginal parameter estimation, even with misspecification 
of the correlation structure by using a robust “sandwich” 
estimator of variance. We use the GEE approach to analyze 
sample survey data by additionally incorporating a sampling 
weight matrix as follows: 


= Pye( Eye 0 
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where D/,, is the (q x m,,) derivative matrix d[p,, | /dp, 
W,,; isa (m,, x m,;) main diagonal matrix consisting of the 
person-specific sampling weights w,,,, V,,; 18 a (m7); * ™),;) 
working variance-covariance matrix for the within-cluster 
responses, a is a (m,, x 1) response vector consisting of 
the responses Y,,,, and p,, = EL[Y,,] is possibly a function 
of the (¢ x1) parameter vector B. The GEE can then be 


solved non-iteratively, resulting in the usual estimate 
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if we are estimating a common mean u = B (gq = 1) and 
are using an independence working covariance matrix. 
Assume a pair of continuous responses are observed for 
the j‘ member of the i'" cluster of the A" stratum, Y,, 
and Y,,,., and their expected values are ,,, and [L,,>. 
Again, assume we are estimating common means i, and 
uu, without covariates for the pair of within-subject 
continuous responses, which can be estimated by using the 


above generalized estimating equation. 
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Barnhart and Williamson (2001) demonstrated how three 
sets of generalized estimating equations can be used to 
model the CCC defined in (1) using correlated data. We 
extend Barnhart and Williamson’s (2001) second set of 
GEE equations to estimate the variances of the continuous 
responses by again incorporating a weight matrix as 
follows: 


F,;, Wi, A: Oe on (ore c. fl,, 2) = 9, 

where F/. is the (2 x 2m,,) derivative matrix d[8;,|/do* 
with o° =[o;,03], W,, isa (2m,, x 2m,,) main diagonal 
matrix consisting of the person-specific sampling weights 
Wj. H,; 18 a (2m, x 2m,,) working variance-covariance 
matrix for the within-cluster squared responses, Nee = 
ISR Pa 8 AE one Yim, eV him, 21 is a (2m,, x1) 
response vector of the continuous variables, and 5. = = 
E[Y,.|. Although 6;. is a function of both the variance 
terms 6; and 6% and the means p, and p1,, it is assumed 
that the means are fixed in 8. and one only takes deriva- 
tives of 5;, with respect to the variances. Again we choose 
the (2m,, x 2m,;) matrix H,, to be the “independence” 
working variance-covariance matrix and the (2m,, x 1) 
column vector 87, = [07 + u*,05 + p5,...,0; +[1.,0, + 
tal because we are assuming common variances and 
means across all strata and clusters. The above GEE can 


thus be solved non-iteratively: 
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for the p™ measurement in the pair, p = 1, 2. 

The CCC can be estimated in a third set of estimating 
equations by using the pairwise products of the responses to 
model 6,,, once the means and variances are estimated. Let 
Uh, = Yaa Yn» Ynior Ynizz9 F itmy 12) himp, >] bea (m,,;x 1) 
ik of pairwise products of the responses and denote 

6,, = E[U,,], which is a function of the means, variances, 
and CCC. We solve for 6, in a third set of estimating 


equations: 
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where C,,, is a (1x m,,) derivative vector = 00,,/0p., 
W,,, isa (m,,; < m,;) main diagonal matrix consisting of the 
person-specific sampling weights w,,, and K,, is a 
(m,; < m,;) Working covariance matrix that we choose to 
be the “independence” covariance matrix. The above GEE 
can be solved non-iteratively: 
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Estimating agreement coefficients from sample survey data 
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2.2 Linearization estimator of variance 


The usual robust estimators of variance for the means 
and CCC from the GEE approach are invalid here because 
they do not take into account the sampling structure, only 
the correlation of observations made on the same individual. 
We propose standard error estimation using the Taylor 
series linearization method (Binder 1983; Binder 1996). The 
first derivatives of p. (equation 1) with respect to [,, 
Ly, O;, 65, and o,, are: 
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The above equation can be rearranged into two parts, one 
involving the parameter estimates {i,, fi,,6;,65, and 6,, 
and the other involving only parameters which does not 
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contribute to the variance estimation of 6,. Thus the first 
part becomes 
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where Wy = Why / (Ee or! Ly | W,,)- Equation (3) be- 


comes a linear function of the data after the summation is 

moved to the front, which we can then express as )/, 
ar h 

v2 one Wai Z;,,;. Where 
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One then creates a random variable Z,,, based on equation 
(4) that replaces the parameters with their respective esti- 
mates. The variance of this new estimator Z,,, 1s an approxi- 
mation for the variance of 6, which can be estimated using 
standard survey software (see Appendix). 


2.3. Jackknife estimator of variance 


We also use the jackknife technique for standard error 
estimation of the parameters following Rust and Rao (1996, 
Section 2.1) for comparison with the linearization estimates. 
The jackknife technique is implemented by calculating a set 
of replicate estimates and estimating the variance using 
them. A replicate data set is created for each cluster by 
deleting all observations from the given cluster from the 
sample. The weights of all other observations in the stratum 
containing the cluster are inflated by a factor n,/(n, —1). 
Weights in the other strata remain unchanged. Thus, the 
new weights for the replicated data set created by removing 
Cluster 7 from stratum / are: 


oh = Wes if k#h (different strata) 


Oy, = Wry My, /(n,—1) if 14i 
(same strata but different clusters) 


(for the cluster being removed). 
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The resulting jackknife variance estimator for 6, is 


where ,,,,) 18 estimated in the same way as p,, but using 
the recalculated weights ‘”’ instead of the weights o. 
The jackknife estimators for the means are similarly 
calculated. 


2.4 The kappa coefficient 


Assume a pair of binary responses are observed for the 
j'" member of the i" cluster of the h" stratum, Y,,, and 
Y,,;2» and their expected values are the probabilities hij 
and 1,,,.. Again assume we are estimating common proba- 
bilities 7, and 7, without covariates for the pair of within- 
subject binary responses. Lipsitz et al. (1994) demonstrated 
how two sets of generalized estimating equations can be 
used to develop simple non-iterative estimates of the «k - 
coefficient that can be used for unbalanced data as previous 
estimates of kappa and its variance were only proposed for 
balanced data. They defined the binary random variable 
Opt Yi Ye (len, os) Clee Ss Lent bothtwre- 
sponses in the pair agree and 0 otherwise. Accordingly, 
E[U,,,] = P,, which denotes the probability of observed 
agreement and is assumed here to be constant over all strata, 
clusters, and pairs of observations. Now let ELY,,,, Y,j.] = 
Pr[Y ii = Yuya = 1) = @. The probability of observed 
agreement can be expressed as P =1—7, —7, + 20. 
The probability of expected agreement by chance is defined 
as P = 1,7, +(1—7,)(1—7,) and is estimated by P = 
tt, +(1-7,)(1—7,), where 7, and 7, are calculated 
in the first set of estimating equations. 

We can derive estimates of « from sample survey data 
following the approach for the CCC in Section 2.1. We can 
incorporate the survey weight matrices into Lipsitz et al.’s 
(1994) two sets of GEE equations for estimating kappa. 
Then, by choosing “independence” working covariance 
matrices for the two sets of equations as in Lipsitz et al.’s 
(1994) approach, we arrive at the following non-iterative 
estimate of kappa for sample survey data: 
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This estimator is identical to Lumley’s (2010), which can be 
computed using the R_ software survey package and 
svykappa function. 

Standard error estimation of « can be conducted 
similarly to that of 6, using the Taylor series linearization 
method. The first derivatives of kappa with respect to 
Tce tt, tll Tent ate: 
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Replacing the parameters in (6) with their respective 
estimates, one then treats Z ij as a random variable and 
estimates its variance using standard survey software that 
accounts for the sampling design. The variance of this new 
estimator Z,,, is an approximation for the variance of Kk. 
The jackknife method can also be used to estimate the 
variance of «. 


3. NHANES III survey 


We used data from the Third National Health and Nutri- 
tion Examination Survey to illustrate our method. NHANES 
III was conducted by the National Center for Health Statis- 
tics of the Centers for Disease Control and Prevention and 
was designed as a six-year survey divided into two phases 
(1988-1991 and 1991-1994). The data were collected using 
a complex, multistage, probability sampling design to 
select participants representative of the civilian, non- 
institutionalized US population. Details of the survey 
design and analytic and reporting guidelines were published 
in the NHANES III reference manuals and reports (National 
Center for Health Statistics 1996). 


3.1 The adolescent weight study 


Obesity is a rapidly increasing public health problem 
with surveillance most often based on self-reported values 
of height and weight. A series of recent studies and systemic 
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reviews have attempted to assess the agreement between 
self-reported and measured weight, especially in the ado- 
lescent population. The general findings suggest that self- 
reported weight was slightly lower than measured weight, 
and that a significant number of normal weight adolescents 
misperceive themselves as overweight and are engaging in 
unhealthy weight control behaviors (Field, Aneja and 
Rosner 2007; Gorber, Tremblay, Moher and Gorber 2007; 
Sherry, Jefferds and Grummer-Strawn 2007). Therefore, 
researchers have suggested that obesity prevention programs 
should address weight misperceptions and the harmful 
effects of unhealthy weight control methods even among 
normal weight adolescents (Talamayan, Springer, Kelder, 
Gorospe and Joye 2006). A similar Canadian study from the 
2005 Canadian Community Health Survey that focused on 
adult individuals also showed that associations between 
obesity and health conditions may be overestimated if self- 
reported weight is used (Shield, Gorber and Tremblay 
2008). We use data obtained from the Body Measurements 
(Anthropometry) component of the NHANES III study to 
estimate the CCC that measures agreement between self- 
reported and measured weight (pounds) obtained from 
adolescents (aged 12 through 16 years). 

The self-reported weight was obtained just prior to the 
actual measurement of weight. We use data from the entire 
six-year survey period (both 1988-1991 and 1991-1994). 
For simplicity, we excluded one stratum which only had one 
PSU. Hence, there were 48 strata and each stratum had two 
PSUs. The sample weight labeled wtpfex6 accounting for 
the differential selection probability was used in our 
analyses. There were 1,651 subjects with complete data for 
both weight measurements. The estimates of the self- 
reported and actual weights (in pounds) were 135.5 (s.e.= 
1.8) and 136.3 (s.e.=1.8), respectively, calculated using 
PROC SURVEYMEANS in SAS. The estimates of the 
standard errors based on the jackknife approach are the 
same as above. 

The CCC is a natural choice for assessing the agreement 
between the two weight measurements because they are 
measured on the same scale and their ranges are similar 
(self-reported weight: 78 Ibs ~ 350 lbs and actual weight: 
73 lbs ~ 372 lbs) (Lin and Chinchilli 1997). The estimate 
of the CCC for measuring the agreement between the two 
definitions of weight using the proposed method is 0.93. 
The standard error of the estimate is 0.021 using the Taylor 
series linearization method. The jackknife standard error of 
0.021 agrees closely with the linearization standard error. 
These statistics are summarized in Table | along with their 
values computed when the sampling structure is ignored. 
The standard errors for the estimates incorporating the 
sampling structure are much larger than the unweighted 
estimates. 
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Table 1 

Unweighted and weighted average, CCC, and _ respective 
standard errors for adolescent self-reported and actual weight 
in pounds 


Self-reported Actual CCC 
Unweighted Estimate 135331 136.96 0.890 
SE 0.76 0.80 0.0005 
Weighted Estimate 1547 136.30 0.926 
SE eS 1.82 0.0205 


Similar to the CCC, the usual Pearson correlation 
coefficient between the self-reported and the actual weight 
measures is also 0.93. In this case, the mean difference 
between the two weight measurements is just less than one 
pound. When subpopulations are examined, differences are 
noted in the CCC and the Pearson correlation coefficient. 
Consider a subpopulation of those individuals that had a 
measured weight > 200 Ibs at examination. Summarizing 
the data for this subpopulation, the self-reported weight is 
on average 8 pounds less than the measured weight 
(223.2 lbs vs 231.4 lbs). There is a slight departure of the 
CCC (0.72) from the Pearson correlation coefficient (0.76). 
The discrepancy between the two measures increases in the 
more obese subgroup. In the subpopulation where measured 
weight is > 220 lbs, the means of self-reported and 
measured weights are 231.9 lbs and 248.8 lbs, respectively. 
The CCC is 0.67, whereas the Pearson correlation coef- 
ficient is 0.85. In this situation, the CCC reflects both the 
reproducibility and differences between the self-reported 
and measured means. Therefore, the CCC is informative 
and advantageous when considering these comparisons, 
particularly in domain analysis within a complex survey. 


3.2 The oral health study 


Slade and Beck (1999) used extent of pocket depth and 
loss of attachment as indices of periodontal conditions. 
Prevalence of periodontal disease using previously reported 
thresholds of pocket depth > 4 mm and attachment loss 
> 3 mm were estimated by Slade and Beck (1999, Table 
1). Pocket depth may be reflective of inflammation rather 
than chronic periodontal disease and, thus, attachment level 
may be a more meaningful measure of periodontal destruc- 
tion. However, pocket depth remains the recommended 
measurement in clinical practice (Winn, Johnson and 
Kingman 1999). Therefore, we compare the agreement of 
these two definitions of periodontal disease using the kappa 
coefficient. 

We use the sample that was analyzed by Slade and Beck 
(1999). The data include 14,415 persons aged 13 or older 
who had complete pocket depth and attachment loss 
assessment by six designated dentists. We again use data 
from the entire six-year survey period (both 1988-1991 and 
1991-1994). There were a total of 49 strata and each stratum 
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had two PSUs. The variable labeled sample weight, 
wtpfex6, accounting for differential selection probability, 
was used in our analyses. 

The first definition of periodontal disease is pocket depth 
24mm and the second is maximum attachment loss > 
3 mm. For both variables we are using the maximum values 
among all teeth in an individual’s mouth. The probability 
estimates of the attachment loss and pocket depth variables 
are 0.358 (jackknife s.e.=0.0088) and 0.212 (jackknife 
s.e. = 0.016), respectively, using the proposed method. The 
asymptotic standard errors based on the usual Taylor 
series expansion (Woodruff 1971, produced by PROC 
SURVEYFREQ mn SAS, version 9.1) are 0.0088 and 0.015, 
respectively. 

Kappa is a natural choice for assessing the agreement 
between two binary ratings as it corrects for chance agree- 
ment (Fleiss 1981). The estimate of kappa for measuring the 
agreement between the two definitions of periodontal 
disease (pocket depth of = 4 mm and attachment loss of 
> 3 mm) using the proposed method is 0.307. The standard 
error of 0.0158 was obtained by both the Taylor series 
linearization and jackknife methods. Table 2 compares these 
results to the measures when the complex sampling struc- 
ture is ignored. The standard error of the kappa coefficient is 
larger when accounting for the survey structure. 


Table 2 
Unweighted and weighted average, kappa, and respective 
standard errors for attachment loss and pocket depth 


Attachment Pocket Kappa 

Loss Depth 
Unweighted Estimate 0.393 0.283 0.334 
SE 0.004 0.004 0.008 
Weighted Estimate 0.358 0.212 0.307 
SE 0.009 0.016 0.0158 


4. Discussion 


The CCC and kappa evaluate the agreement between two 
measurements for continuous and categorical responses, 
respectively. In this paper, we have proposed a generalized 
estimating equation approach for estimating the CCC for a 
pair of continuous variables, and kappa for a pair of binary 
variables, from sample survey data where the data have 
been collected using complex survey features such as 
stratification or clustering. The usual sandwich estimator of 
the variance only accounts for repeated measurements made 
on the same individual, and does not account for the 
sampling framework (e.g., clustering, stratification, and 
weighting). In the GEE approach, standard error estimation 
of the estimators is conducted with the Taylor series 
linearization and jackknife approaches. If the data are not 
collected using complex survey features, the proposed 
estimators will be identical to the usual estimators. As is 
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evident in the two examples from the NHANES III study, 
we have shown the need to incorporate sampling weights 
and the sampling design features so that the standard errors 
are not underestimated when data are collected from a 
complex sampling design. Tables 1 and 2 show that there 
were large differences in the standard errors between 
weighted and unweighted estimates of the standard errors 
for both CCC and kappa. Confidence intervals that incor- 
porate weights and the design features will allow correct 
inference. 

In the appendix, we show steps for calculating the 
weighted measures of the CCC and kappa, along with their 
standard errors using standard survey software that incor- 
porates the sampling weights, clustering and stratification. 
The GEE approach is advantageous because it is a conve- 
nient framework for developing estimators of the agreement 
coefficients and is easily extended to multiple raters, 
multiple methods, covariate adjustment and unbalanced 
cluster sizes. This design-based approach results in correct 
standard error estimation without assuming an underlying 
model and accounting for the sampling structure. If one is 
interested in estimating the agreement between two ordinal 
variables with kappa then Williamson ef al.’s (2000) gener- 
alized estimating equation approach can be extended 
similarly to the proposed method. 
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Appendix 


Steps for calculating the CCC and its standard error using 
standard survey software 

Step 1: Calculate the means of the continuous variables 
Y,; and Y,,,. using software for survey data that 
incorporates stratification, clustering, and sample 
weighting (e.g., PROC SURVEYMEANS in SAS). 
Square the centered Y,,,, and Y,,,. values around 
their respective means. 


Step 2: 
Step 3: Calculate the means of the squared centered Y,,, 
and Y,,,. values using standard software for survey 
data. These means are the variance estimates of 
Y,, and Y,,,.. Calculate the mean of the product 
of the centered Y,,,, and Y,,,. values using stan- 
dard software for survey data. This mean is the esti- 
mated covariance of Y,,,, and Y,,,». 

Calculate the CCC by substituting the estimated 
means and variances into equation (1). Create the 


new variable Z,,, based on equation (4). 


Step 4: 
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Step 5: Calculate the standard error of Z,,, using standard 
software for survey data. The standard error of Z 
estimates the standard error of 6... 


hij 


SAS CODE: 


Let yl and y2 denote the variables for the pair of 
continuous responses, and s, c and w denote the variables 
for strata, cluster and weight: 


PROC SURVEYMEANS DATA=dataset MEAN; /* Step | above */; 
STRATA s; 
CLUSTER ‘¢; 
WEIGHT w,; 
VAR yl y2; 
ODS OUTPUT STATISTICS=stat; 
data null ; 
set stat (where=(varname= ‘yl’ )); 
call symputx(‘muy1’, mean); 
data_null ; 
set stat (where=(varname= ‘2’ )); 
call symputx(‘muy2’, mean); 
data dataset; set dataset; /* Step 2 above */; 
cyl = yl — &muyl; 
cy2 = y2 — &muy?2; 
vary! = cyl * *2; 
vary2 = cy2 * *2; 
covyl2 = cyl * cy2; 
PROC SURVEYMEANS MEAN; 
STRATA s; 
CLUSTER *c: 
WEIGHT w; 
VAR varyl vary2 covy12; 
ODS OUTPUT STATISTICS=stat; 


/* Step 3 above */; 


run; 
data null ; 
set stat (where=(varname= ‘varyl’ )); 
call symputx(‘vary 1’, mean); 
data null ; 
set stat (where=(varname= ‘vary2’ )); 
call symputx(‘vary2’, mean); 
data null ; 
set stat (where=(varname= ‘covy12’ )); 
call symputx(‘covy 12’, mean); 
data dataset, set dataset; /* Step 4 above */; 
d = &varyl + &vary2 + (&muyl — &muy2) ** 2; 
CCC = 2.* &covyl2/d; 
Z= (2/a)* (cyl * cy2) —@ * &covyl2/d/d)* (yl **2)+ 
(cy2 * *2) + 2 * (&muyl — &muy2) * (yl — y2)); 
PROC SURVEYMEANS MEAN; /* Step 5 above */ ; 
STRATA 5s; 
CLUSTER c; 
WEIGHT w; 
VARUGCE «Ze 
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Steps for calculating kappa and its standard error using 
standard survey software 


Step |: Estimate the probabilities of the binary variables 
Y,,, and Y,,,. using software for survey data that 
incorporates stratification, clustering, and sample 
weighting (e.g., PROC SURVEYFREQ in SAS). 


BiepabstmateeP( Sater Cae 1}. 705,)). 

Step 3: Create the new agreement variable U,,, (= Y, 
i) oc dl a Yi) 7; Yny2))- 

Step 4: Calculate the sum of the sample survey weights 
and the sum of the weighted U,,, (e.g., using 


PROC SURVEYMEANS in SAS). Estimate 
kappa using equation (2). 


rif | 


Step 5: Create a new variable z,,, using equation (6). 


Step 6: Calculate the standard error of z,,, using stan- 
dard software for survey data. The standard error 
of z,,, estimates the standard error of k. 
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Combining synthetic data with subsampling 
to create public use microdata files for large scale surveys 


Jérg Drechsler and Jerome P. Reiter ' 


Abstract 


To create public use files from large scale surveys, statistical agencies sometimes release random subsamples of the original 
records. Random subsampling reduces file sizes for secondary data analysts and reduces risks of unintended disclosures of 
survey participants’ confidential information. However, subsampling does not eliminate risks, so that alteration of the data is 
needed before dissemination. We propose to create disclosure-protected subsamples from large scale surveys based on 
multiple imputation. The idea is to replace identifying or sensitive values in the original sample with draws from statistical 
models, and release subsamples of the disclosure-protected data. We present methods for making inferences with the 


multiple synthetic subsamples. 


Key Words: Confidentiality; Disclosure; Multiple imputation. 


1. Introduction 


National Statistical Institutes (NSIs) like the U.S. Census 
Bureau and Statistics Canada conduct large scale surveys 
that are highly valued by secondary data analysts, such as 
the American Community Survey (ACS) and the National 
Longitudinal Survey of Children and Youth (NLSCY). 
While these analysts desire access to as much data as 
possible, the NSI also must protect the confidentiality of 
survey participants’ identities and sensitive attributes. A 
common strategy for reducing disclosure risks in large scale 
studies is to release subsamples of the original survey data; 
for example, the Census Bureau releases a subsample from 
the collected ACS data comprising 1% of all U.S. house- 
holds (the collected ACS data comprise 2.5% of all house- 
holds), and Statistics Canada releases a 20% sample of indi- 
viduals from the NLSCY. See Willenborg and de Waal 
(2001) and Reiter (2005) for discussions of the confiden- 
tiality protection engendered by sampling. Typically, how- 
ever, subsampling alone does not eliminate disclosure 
risks, particularly for units in the subsample with unusual 
combinations of characteristics. NSIs therefore alter data 
before dissemination. For example, in the ACS, the Cen- 
sus Bureau performs data swapping, topcoding of selected 
variables, aggregating of geography, and age perturbation; 
in the NLSCY, Statistics Canada uses data swapping and 
suppression. 

When implemented with high intensity, as may be nec- 
essary to protect confidentiality in highly visible surveys, 
standard disclosure limitation strategies can seriously distort 
inferences (Winkler 2007; Elliott and Purdam 2007; 
Drechsler and Reiter 2010). Further, for many standard 
techniques it is difficult for data analysts - especially those 


without advanced statistical training - to properly account 
for the effects of the disclosure control in estimation. Moti- 
vated by these limitations, we propose a new approach for 
generating public use microdata samples from large scale 
surveys called subsampling with synthesis. The basic idea is 
to replace identifying or sensitive values in the original 
sample with multiple draws from statistical models esti- 
mated with the original data file, and release subsamples of 
the disclosure-protected data. The subsamples can com- 
prise one common set of records, or they can be taken 
independently. 

This approach is a variant of partially synthetic data 
(Little 1993; Reiter 2003), which has been used in the U.S. 
to create several public use data products, including the 
Survey of Income and Program Participation, the Longitudi- 
nal Business Database, the Survey of Consumer Finances, 
the American Community Survey group quarters data, and 
OnTheMap. The approach proposed here differs from par- 
tial synthesis because of the subsampling, which neces- 
sitates adjustments to the inferential methods of Reiter 
(2003); these are presented here. The approach also differs 
from the methods for creating synthetic public use micro- 
data samples of census data developed recently by Drechsler 
and Reiter (2010). In subsampling with synthesis, the initial 
data come from a survey and not from a census; thus, infer- 
ences must account for the additional uncertainty that results 
from the initial sampling. 


2. General approach 


We now describe the data generation and inferential 
procedures for the two approaches to subsampling with 
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synthesis: releasing different (independent) subsamples, and 
releasing a common set of records in each subsample. The 
data generation methods, as well as methods for making 
valid inferences from the multiple datasets, depend on the 
subsampling approach. For both approaches, we let D 
denote the original survey data of 1, units sampled from a 
population consisting of N units. We initially assume that 
the original sampling design is a simple random sample; we 
later extend to stratified sampling. We assume that all 
sampled units fully respond in D. Unlike for standard 
partial synthesis (Reiter 2004), methods have not been 
developed to handle missing data and synthesis with sub- 
sampling simultaneously. We focus here on general descrip- 
tions of the approaches and presentation of the inferential 
methods. We do not discuss synthesis model building stra- 
tegies; see Drechsler and Reiter (2009) and the references 
therein for guidance. 


2.1 Releasing different random subsamples 
Pr Kes | 


To begin, the NSI creates m partially synthetic datasets, 

Dy, = {Di = 1,..., mj, for the original survey following 
the approach of Reiter (2003). Specifically, the NSI replaces 
identifying or sensitive values in D with multiple impu- 
tations. Synthesis models are estimated using only the 
records whose values will be synthesized. The synthesis is 
done independently m times, resulting in D,,,. The NSI 
then takes a simple random subsample of n, <n, records 
frommedch) |) alhese Syreesubsamplessy'! di, =i = 
1, ..., m}, are released to the public. . 
The analyst of d,,, seeks inferences about some esti- 
mand Q, such as a population mean or regression coef- 
ficient. In each d,, the analyst estimates Q with some point 
estimator g and estimates the variance of g with some 
estimator u, where the analyst specifies g and w acting as 
if d, were the collected data. Here, wu is specified ignoring 
any finite population correction factors; for example, when 
q is the sample mean, u = s’/n,, with s° being the 
sample variance. For 71 =1,..., 7m, let.g, and wu. “be the 
values of g and u in d.. The following quantities are 
needed for inferences. 


Summary of approach 


In . DB qj /m (1) 
i=l 
hi = Cr a q,,)° /(m—-1) (2) 
i=1 
Urn op ye U; /m. (3) 
i=l 


The analyst then can use g,, to estimate O and 
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T= On hye ee Te (4) 


to estimate the variance of g,,. Derivations of these 
estimates are presented in Section 2.1.2. We note that 
without subsampling, i.e., 1, = n,, (4) equals the variance 
estimate for standard partial synthesis (Reiter 2003). For 
large n,, inferences are based on a f-distribution, (¢,, — 
Q) ~ ty (0, 7), with degrees of freedom v, = (m—1)(1+ 
(n, /n,—n,/N)miu.,/b,,)°. 

The inferential methods can be extended to stratified 
samples in which the NSI uses the same strata for the 
subsample and original sample. Let NV, be the population 
size in stratum h, where h = 1,..., H. For each h, let 
G,,, and T,, be the values of (1) and (4) computed using 
only the records in d,,, in stratum h. These estimates are 
used in inferences for population quantities in stratum /. 
For inferences about the entire population mean, the point 
estimate of QO is g,,= %,(N,/N) @,,,, and its estimated 
variance is T, =>,(N,/N) T,,. Point and variance esti- 
mates for nonlinear functions of means can be derived using 
Taylor series expansions. We note that NSIs should release 
the values of n,,/n,, for all strata to enable variance 
estimation. 


2.1.2 Derivation of inferences for the different 
random subsamples approach 


The analyst seeks f/(Q | d.,.), which can be written as 


FQ dye) = [FQ| Dons Typ) F (Dayy| Aon) DDoS) 


For all derivations in Section 2.1.2, we assume that the 
analyst’s distributions are identical to those used by the NSI 
for creating D,,,. We also assume that the sample sizes are 
large enough to permit normal approximations for these 
distributions. Thus, we require only the first two moments 
for each distribution, which we derive using standard large 
sample Bayesian arguments. Diffuse priors are assumed for 
all parameters. 

Let QO. and U, be the point estimate of QO and its vari- 
ance that the analyst would compute with D. (which is not 
available to the analyst). Let O,, U,,, and B. be defined as 
in (1)-(3) but using Q, and U,. From standard partial 
synthesis results (Reiter 2003), we have (Q|D,,,) ~ N(Q,,, 
U,,+B,,/m). We assume that (q,|D,) ~NM(O,,(1—n,/n,)u;) 
and, as is typical in multiple imputation contexts, that 
u, ~ u,,. Thus, using standard Bayesian theory, we have 
(O,,|\d,,.)~N(@G,» (1 =n,/n)iz,,/m) and ((m-1)b,,/(B,,+ 
(1 = n/m) H,) | dan) ~ %na+ Hence, we have f(Q|d,,,, 
BU) =N (Gus ,+ B,, (m+ (— n,/m) ,,/m). 

To get f(Q |d,,,,), we need to integrate out B, and U,, 
from this distribution. We do so by substituting B, and U,, 
with their approximate expected values. To approximate 
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E(B,,| d,,,), we use b,,— (1 —n,/n,) u,,. To approximate 
E(U,,|\ d,,,)» we note that 


Var(Q | d,) = E[var(Q | D,) | d;] + var[E(Q | D,) | d;] 
Ser aay vatt@, a). (6) 


Here, var(Q | d,) = (1 —n,/N)u,. Solving (6), we have 
E(U,,| d,,) ¥ (%/m —7n,/N)u,, After substitution of 
these expected values, we have var(Q | Cea h Bean oe 

Since we use an estimated variance for QO, we approxi- 
mate f(Q|d,,,) with a ¢-distribution with mean g,, and 
variance 7). The degrees of freedom, v,, is derived by 
matching the first two moments of (v,7),)/ {(n,/n, — 
Di NG + Dm + (l— 27, yum} to those-ofa jie 
distribution. 


2.2 Releasing the same random subsample 


At first glance, releasing a common set of records in each 
subsample looks like standard partial synthesis. However, 
Reiter’s (2003) variance estimator can be positively biased 
in this context. To illustrate, suppose that D comprises one 
variable with sample mean x,. Also suppose that we create 
D,,, by replacing all values of x, and we randomly select a 
common set of n, records for the subsample. Let m = 0, 
and let O be the population mean of x. If replacements are 
simulated from the correct model, which is estimated with 
einen Ox... Lence, var(q.)' is identical to the 
variance of X, which is (1—17,/N) s;/n,. However, 
Reiter’s (2003) variance estimate includes 7, based on 
(1—n,/N)s>/n,, where E(s;) = s;. Hence, in general 
Reiter’s (2003) variance will have positive bias for sub- 
samples with synthesis. 

In place of standard partial synthesis, we adopt the ap- 
proach taken by Reiter (2008) for multiple imputation for 
missing data when records used for imputation are not used 
or disseminated for analysis. This setting is akin to sub- 
sampling the same records in each d, because the models 
for the synthesis are estimated with D, but the analyst only 
has d,,,, for analysis; that is, not all records used for impu- 
tation are disseminated for analysis. 

For convenience, we summarize the methodology of 
Reiter (2008) here but do not include the derivations. First, 
as in standard partial synthesis, the NSI estimates the syn- 
thesis models using only the records whose values will be 
synthesized. Let 8 be the parameters that govern the distri- 
bution of the synthetic data models. Second, the NSI sam- 
ples m values of 0 from its posterior distribution. Third, 
for each drawn 0 where / = 1,..., m, the NSI draws a 
replacement dataset D‘’”) from the synthesis models based 
on 0°”. The NSI repeats this process r times for each 
0. Finally, the NSI releases the collection of M = mr 
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subsamples from these datasets, di= {dv eet. ck, m: 
p =1,...,r}. Each d“” includes an index of its nest /. 

For / =\1,...., meand p =1,...,7r,let g°?.and u®” 
be the estimate of Q and its estimated variance computed 
with d°”. Here, u‘°” includes the finite population cor- 
rection factor. The following quantities are used for 
inferences: 


m r m 


Tie E.G Ga ad, Mh (7) 
[=I 


I=1 p=l 


m Ga 


By = VY (qe? - Gy lime - D} = Y wm, (8) 
1=} 


Jha) 


by = (GO -GY/(m—-1), (9) 
=| 


m r 


es yee uP) / (mr). (10) 


i pal 


The analyst can use g,, to estimate Q and T. = w,,—- 
WytU+1/m) b,,— W,,/r to estimate the variance of ¢),. 
When , is large, inferences are based on a ¢ -distribution, 
CO ty (0, T.), with degrees of freedom 


gi eee (al 
(m—1)T? {m(r —1)}Ty 


It is possible that 7, < 0, particularly for small m and 
r. Instead, analysts can use the always positive but conser- 
vative variance estimator, 7) 3 = iy lend) Gleesign) bas 
where 7 = 1 when 7, > 0 and A = 0 otherwise. Motiva- 
tion for this estimator is provided in Reiter (2008). Gener- 
ally, negative values of 7. can be avoided by making m 
and r large. When 7, < 0, inferences are based on a f¢ - 
distribution with (m—1) degrees of freedom, which comes 
from using only the first term and 7 in (11). 

For stratified designs, the point estimate for whole 
population quantities is 9), = X,(N,/N) Gy, and its esti- 
mated variance is T, = ¥,(N,/N)°T,,, where G,,, and 
T., are the point estimate and its variance in stratum h. The 
degrees of freedom in the ¢-distribution for stratified 


sampling is 


{., CN, INV + Lm)byy 
(m -1) > (N, (NY Ty, 
[> (N, INV 4 A )) 
(mr Dh, MINT, 


This is derived by moment matching to a y° random 
variable. 


Mes 
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3. Illustrative simulations using a stratified 
sampling design 


In this section, we investigate the analytical properties of 
the inferential procedures for subsampling with synthesis for 
stratified simple random sampling. We generate a popula- 
tion of N = 1,000,000 records comprising five variables, 
Y,....%, in H =4 strata. Y, is a categorical variable with 
ten categories generated according to the distribution in 
Table 1. The distributions for (Y,,..., Y;) are displayed in 
Table 2, along with the stratum sizes. 


Table 1 
Empirical distribution of Y, in the generated population 


1 z 3 4 5 6 il 8 910 
percentage 24.77 32.63 16.38 15.06 7.13 2.53 0.95 0.33 0.15 0.09 


To create D, we randomly sample 7,,, = 7,500 records 
from each stratum. Each subsample comprises 7,, = 5,000 
records for each stratum. In practice, the NSI might use 
proportional allocation to set each n,, and choose smaller 
sampling rates to set 1,,. We use a common sample size 
and large sampling fractions to illustrate that the variance 
formulas for subsampling with synthesis correctly handle 
non-trivial finite population correction factors, e.g., 50% of 
the records are sampled in stratum 4. 

We consider Y, and Y; to be the confidential variables 
and illustrate two synthesis scenarios. In the first, we 


synthesize all records’ values of Y, and Y;. To do so, in 
each stratum we simulate Y,, using a regression of Y,, on 
(Yi. Yo,. %,) estimated with D, and we simulate Y,, 
using a regression of Y,, on (¥,, Y5,, Y3,,. Y4,) estimated 
with D. Predictions of Y;, are based on the synthesized 
values of Y,,. In the second approach, in each stratum we 
replace Y,, and Y;, only for all records with Y,,> p,, 
where p, is the 90" percentile of Y, in the population 
in stratum h. We generate replacement values by sampling 
from regression models; however, the models in each 
stratum are estimated only with those records satisfying 
¥en > Pre 

For the different subsamples approach, we generate 
m = 5 synthetic surveys as outlined in Section 2.1. For the 
same subsample approach, we first draw m = 5 values of 
9, the regression coefficients and variances. For each 0”, 
we generate r = 5 synthetic datasets for every first stage 
nest. 

For all scenarios, we repeat the process of (i) creating D 
by sampling from the population and (11) generating sub- 
samples with synthesis a total of 5,000 times. For each of 
these 5,000 runs, we obtain inferences for fifty quantities, 
including the population means and within-stratum means 
of Y, and Y,, the coefficients from a regression of Y; on all 
other variables, and the coefficients from a regression of Y, 
oa all other variables. The regressions are estimated sepa- 
rately in each stratum. 


Table 2 
Parameters for drawing (Y,,..., Y;) for the population 
Stratum size Model Distribution of the error term 
Stratum | 750,000 Y,=¥,+e e ~ N(O,5) 
¥, =Y+¥, +e 
Yael eee oer a te 
Ye, V5 ea Yee 
Stratum 2 200,000 Y, =2¥,+e e ~ N(O0,10) 
Yz'= 2Y,'+.0.5Y5+ e 
Yo +099 + Yo de 
Yo 21) + O05 + OS, 20 25g ee 
Stratum 3 40,000 Y, = —3Y, +e e ~ N(0, 30) 
Y, = =3), =L35Y, +e 
1, = 3) +351) 3g e 
¥, = =3Y, + i 1/ 3¥opei OYE te 
Stratum 4 10,000 Y, =-2Y, +e e ~ N(0, 20) 


Be 


Y, = —2¥,+7,4+1/4% 42 
Yn = /2Y, Xo. alel 4Y4 Gee SiG ye 
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Figure | displays key results of the simulations. The left 
panel displays the ratios of the simulated average of T, 
(and 7.) over the corresponding simulated var(g,,) for the 
fifty estimands. The median ratios are close to one in all 
scenarios, and the averages of 7, (and 7.) never differ by 
more than 10% from their actual variances. Thus, both 7, 
and 7, appear to be approximately valid variance esti- 
mators. 

The middle panel of Figure 1 summarizes the percent- 
ages of the 5,000 synthetic 95% confidence intervals based 
on 7, (and on 7.) that cover their corresponding Q. The 
coverage rates are close to 0.95 except for the regression 
coefficients for the same subsampling approach with 100% 
synthesis. For these coefficients, T, < 0 in up to 38% of 
the simulation runs, so that confidence intervals are based 
on the conservative 7. The highest fraction of negative 
variances occurs in the smallest stratum which has a 
sampling rate of 50%. All variance estimates are positive 
when only 10% of the records are synthesized. 

The right panel of Figure 1 displays the ratios of the 
simulated root mean squared error (RMSE) of q,, over the 
simulated RMSE from the subsamples without any synthe- 
sis. For the same subsampling approach, the RMSEs of the 
synthetic subsamples tend to be smaller than the RMSEs 
based on the subsamples without any synthesis, particularly 
for the 100% synthesis. The smaller RMSEs result because 
the synthesis models are determined with D, i.e., the survey 
data before taking the subsample, so that they carry addi- 
tional information that is not in the subsamples without 


Avg(Ty);)/ var(qy) 
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synthesis. For the different synthetic subsamples, the RMSE 
ratios typically exceed one. Here, increased synthesis leads 
to greater loss in efficiency. We note that the RMSEs from 
the different sample and same sample approaches in Figure 
1 are not directly comparable because they are based on 
different denominators. 

To enable comparisons across the methods, as well as to 
illustrate the losses in efficiency from subsampling, we 
repeat the simulation design using m = 25 for the inde- 
pendent subsamples approach and mr =25 for the same 
subsamples approach. The left panel of Figure 2 displays the 
simulated RMSE ratios for the fifty estimands in the 
different scenarios, where the denominators are the average 
RMSEs based on the original data before any confidentiality 
protection. The right panel of Figure 2 displays the ratios of 
simulated average lengths of the 95% confidence intervals, 
where the denominators are the average lengths based on 
the original data before any confidentiality protection. Based 
on the left panel, for a given total number of released data- 
sets and given synthesis percentage, the independent sample 
approach results in more efficient estimates than the same 
sample approach. The right panel tells a similar story, 
although it is harder to see because of the scaling. Here, the 
same sample approach with 100% synthesis results in high 
fractions of negative variance estimates, so that the adjusted 
variance Ts is often used, thereby inflating the interval 
lengths. Figure 2 also includes results from synthesis with- 
out any subsampling, which generally provides more effi- 
cient estimates than either subsampling approach. 


RMSE ratio org subsample 


s100 sl0 dlo 


d100 d100 


d10 


Figure 1 Simulation results for the stratified sampling design. In the labels, s and d indicate the same subsample and the different 
subsamples approach. The numbers indicate the percentage of records that are being synthesized. The denominators of 
the RMSE are based on the point estimates from the subsamples without synthesis. For the different subsamples 
approach, the RMSE is computed from the average of the m point estimates. Each box plot comprises fifty estimands 
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Figure 2 Efficiency comparisons for the stratified sampling design. In the labels, org and syn indicate the original sample and the 
synthetic sample before subsampling; and, s, d, and the numbers are as in Figure 1. The denominators of the RMSE are 
based on the point estimates from the original sample without synthesis. Each box plot comprises fifty estimands 


4. Concluding remarks 


The different subsamples and same subsamples ap- 
proaches have competing advantages. For a fixed number of 
released datasets M, the different subsamples approach 
enables estimation with greater efficiency than the same 
subsamples approach - as evident in Figure 2 - since the 
released subsamples are independent rather than correlated. 
The different subsamples approach also guarantees positive 
variance estimates; the same subsample approach does not. 
However, with large M the different subsamples approach 
weakens the confidentiality protections of subsampling, 
since the combined datasets are likely to contain most of the 
records from the original survey. Hence, unless the sub- 
sampling rate is small (e.g., 1% or 2%), the NSI may have 
to make m modest (e.g., m =5) to use the different sub- 
samples approach. Because of this, the different samples 
approach is not viable when the original sample size is 
modest. 

As an alternative to subsampling with synthesis, agencies 
could release partially synthetic data that include all records 
from the original sample, assuming that they are willing to 
release files of that size. Partial synthesis on the original data 
generally engenders estimates with lower variances than 
subsampling with synthesis - as evident in Figure 2 - since 
more records are released. However, partial synthesis on the 
original data generally engenders higher disclosure risks 
than subsampling with synthesis, since more at risk records 
are in the released data and since the additional protection 
from subsampling is absent. Agencies can compare the two 
options on disclosure risks using the methods of Drechsler 
and Reiter (2008), which account for the protection afforded 
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by sampling, and on data utility by comparing inferences for 
representative analyses. 

It is also possible that the process of subsampling may 
engender sufficient additional protection to enable lesser 
amounts of synthesis than would be necessary in a partial 
synthesis of the entire original dataset. Evaluating the data 
utility for subsampling with synthesis versus synthesis only 
for given disclosure risks is beyond the scope of this short 
note, but it is an interesting area for future research. 

We have not developed subsampling with synthesis ap- 
proaches for sampling designs other than (stratified) simple 
random samples. For the different subsamples approach, 
appropriate inferential methods require an approximately 
unbiased estimate of the variance from the first phase of 
sampling that can be computed from the subsample alone. 
This is elusive for complicated designs. For the same 
subsample approach, we conjecture that analysts can use the 
inferential methods presented in Section 2.2, provided that 
Wy, appropriately accounts for the two phases of sampling. 
We note that the formulas for w,, and b,, remain the same 
for other designs. Evaluating this conjecture is a subject of 
future research. 
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A hierarchical Bayesian nonresponse model 
for two-way categorical data from small areas 
with uncertainty about ignorability 


Balgobin Nandram and Myron Katzoff ' 


Abstract 


We study the problem of nonignorable nonresponse in a two dimensional contingency table which can be constructed for 
each of several small areas when there is both item and unit nonresponse. In general, the provision for both types of 
nonresponse with small areas introduces significant additional complexity in the estimation of model parameters. For this 
paper, we conceptualize the full data array for each area to consist of a table for complete data and three supplemental tables 
for missing row data, missing column data, and missing row and column data. For nonignorable nonresponse, the total cell 
probabilities are allowed to vary by area, cell and these three types of “missingness”. The underlying cell probabilities (i.e., 
those which would apply if full classification were always possible) for each area are generated from a common distribution 
and their similarity across the areas is parametrically quantified. Our approach is an extension of the selection approach for 
nonignorable nonresponse investigated by Nandram and Choi (2002a, b) for binary data; this extension creates additional 
complexity because of the multivariate nature of the data coupled with the small area structure. As in that earlier work, the 
extension is an expansion model centered on an ignorable nonresponse model so that the total cell probability is dependent 
upon which of the categories is the response. Our investigation employs hierarchical Bayesian models and Markov chain 
Monte Carlo methods for posterior inference. The models and methods are illustrated with data from the third National 


Health and Nutrition Examination Survey. 


Key Words: Metropolis-Hastings sampler; SIR algorithm; Nonignorable nonresponse model; Expansion model. 


1. Introduction 


In sample surveys, data are typically summarized in two- 
way categorical tables. We consider the problem of non- 
ignorable nonresponse for many rx c_ categorical tables, 
each obtained from a single area. For many of these surveys, 
there are missing data and this gives rise to partial 
classification of the sampled individuals. Thus, for each 
two-way table there are both item nonresponse (one of the 
two categories is missing) and unit nonresponse (both 
categories are missing). One may not know how the data are 
missing and a model that includes some difference between 
the observed data and missing data (ie., nonignorable 
missing data) may be preferred. For a general rxc 
categorical table, we address the issue of estimation of the 
cell probabilities of the two-way tables when there is 
possibly nonignorable nonresponse but there is really no 
information about ignorability. In such a situation, we 
would like to express a degree of uncertainty about 
ignorability. Nandram and Choi (2002a, b) have described 
an expansion model appropriate for binary data when there 
are data from many small areas. We will extend this work to 
rx c categorical tables. 

Letting x denote the covariates and y the response 
variable, Little and Rubin (2002) describe three types of 
missing-data mechanism. These types differ according to 
whether the probability of response (a) is independent of x 


and y; (b) depends on x but not on y; or (c) depends on 
y and possibly x. The missing data are missing completely 
at random (MCAR) in (a), missing at random (MAR) in (b) 
and the data are missing not at random (MNAR) in (c). 
Models for MCAR and MAR missing-data mechanisms are 
called ignorable if the parameters of the dependent variable 
and the response variable are distinct (Rubin 1976). Models 
for MNAR_ missing-data mechanisms are called non- 
ignorable. The general difficulty with nonignorable non- 
response model is that the parameters are not identifiable 
[e.g., see Nandram and Choi (2004, 2005, 2008, 2010) and 
Nandram, Han and Choi (2002)]. 

For a rxc categorical table, let /;,,=1 if the bs 
individual within the i" area falls in the 7" row and k" 
column and 0 otherwise. Also, let J, =1 if the /" 
individual within the i" area has complete information and 
0 otherwise. Finally, let P(J,, =1 elie =1, Jing, =9, J 4 J, 
k'#k)=1,,. For unit nonresponse, if 7, = 7,, the model 
is ignorable; for item nonresponse, if the columns are 
missing, row is observed and 1,, =7, (or 7, =7,), the 
model is ignorable; and if the rows are missing but columns 
are observed and 7,, =, (or 7, =7,), the model is 
ignorable. All other models are nonignorable; see Rubin 
(1976) for further explanation. 

Nandram and Choi (2002a, b) use an expansion model to 
study nonignorable nonresponse binary data. The expansion 
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model, a nonignorable nonresponse model, degenerates into 
an ignorable nonresponse model (in the spirit of Draper 
1995) when a centering parameter is set to unity. This 
permits an expression of uncertainty about ignorability; see 
also Forster and Smith (1998). 

We discuss the model of Nandram and Choi (2002a, b) 
for binary data from small areas. So that J, denote the 
response indicators and /, denote the binary response. 
Specifically, introducing the centering parameters y, for 
area i to incorporate uncertainty about ignorability, the 
model of Nandram and Choi (2002a, b) is 


iid 
I, | p; ~ Bernoulli(p,; ), 
iid 
J, {ni Jg= O}eorBermoulliGey sa L mgn sities tol, 


iid 
Jy (40:5 Vi2o Yq = V3 ~ Bernoulli(y;7,), 0< 7,7, <1. 


When y,=1, the nonignorable nonresponse model 
degenerates to an ignorable nonresponse model. Here y, is 
the ratio of the odds of success among respondents to the 
odds of success among all individuals for the i" area. The 
parameter y, describes the extent of nonignorability of the 
response mechanism for area i, and it is through the y, that 
uncertainty about ignorability is incorporated. Nandram and 
Choi’ (2002a, b) define 6, = 1/{y,p; +U—p;)} to be the 
probability that an individual responds in area 7 in the entire 
population, and with a belief that all the areas are similar 
they take (p,,6,,y,) to have a common distribution. 
Apriori they take beta distributions for p, and 1, 
respectively. 

Here, the parameters are not identifiable. However, if 
y, =1, then all the parameters are identifiable. That is, 
identifiability of the parameters depend on the y,. Note, that 
when y,=1, we get an ignorable model for a MAR 
mechanism. As the parameters are identifiable in this model, 
it is quite sensible to use this model (or similar models) as a 
baseline model. However, note this model is still not 
justified because it assumes that missing data are like 
observed data. Thus, to add flexibility to this ignorable 
nonresponse model, we use the y;. 

Let y,,, be the number of individuals with J, =u, 
Je=VH, v= 0, Lytinithe i” area. Then, under the model, 


ind 


(Vi002 Viow Yio» Yi.) | js Pir ¥; ~ Multinomial {n,, 


G—p,) d—1,), d—p;,) %,,U—7,0;) P.¥:1 Ps 


with independence over areas. Here, only y,), and y,,, are 
observed, and therefore all parameters are nonidentifiable if 
the y, are unknown. We obtain the likelihood function in a 
similar manner for the more complete rx c categorical 
table with missing data. 
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We start with a gamma distribution, and to permit 
centering on the ignorable nonresponse model, we must take 
each y, to have mean 1. However, we need to use a 
truncated gamma distribution because O0<7,<1 and 
0<y, <1/7,. An interesting idea of Nandram and Choi 
(2002a, b) is to model the centering as a truncated gamma 

y, |v i Gamma(v,v),0<y, <1/17,,0<1, <1. 
The model is complete with noninformative prior densities 
on all hyperparameters. One can use alternative distributions 
(e.g., a truncated lognormal density) for the y,, but this is 
not a key issue and it would not matter much. 

One can use an area level model with random effects in 
which, conditional on the observed data, the nonresponse is 
dependent upon area-level random effects. This can be 
formulated using a logit link function, but we have not 
developed our models in this direction partly because we are 
not using covariates here; see Nandram and Choi (2010) for 
the use of covariates and random effects. 

The approach in Nandram and Choi (2002a,b) is 
attractive, but it does not apply immediately to the current 
rxc_ categorical table problem. Specifically, only one 
centering parameter per area is needed in Nandram and 
Choi (2002a, b). In our formulation, one now needs rc 
centering parameters per area; each of these parameters has 
to have a distribution centered at one to allow degeneration 
to the ignorable nonresponse model. There are also 
inequality constraints that must be included in the non- 
ignorable nonresponse model. In addition, one cannot rule 
out the possibility that these parameters are correlated. The 
methodology needed to apply the work of Nandram and 
Choi (2002a,b) to the rxc categorical table is not 
straightforward. Noting these difficulties Nandram, Liu, 
Choi and Cox (2005) (with a single supplemental table) and 
Nandram, Cox and Choi (2005) (with the three 
supplemental tables) use a simpler idea, but not quite as 
elegant as in Nandram and Choi (2002a, b), for centering; 
see also Nandram and Choi (2005). 

Essentially, Nandram, Cox and Choi (2005) and 
Nandram, Liu, Cox and Choi (2005) assume an ignorable 
model, obtain samples of the response probabilities and use 
these sampled response probabilities to fit the response 
probabilities of a nonignorable nonresponse model while 
“controlling” its parameters. Of course, a possible alter- 
native occurs when there is information about the degree of 
nonignorability. However, the problem of incorporating 
prior information about a systematic departure from 
ignorability is more complex for our problem, and it would 
need additional costly field work to obtain such information. 

We discuss our philosophy about the nonignorable 
nonresponse problem, a fundamentally aliased problem. In 
fact, this problem is extremely difficult and we believe that 
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there is really no solution to it, but we must try. Without any 
information one cannot tell how respondents and non- 
respondents differ. An ignorable nonresponse model is short 
because it assumes that respondents and nonrespondents are 
similar, but the respondents and nonrespondents may differ. 
Statisticians must not confront imprecision (sampling error) 
only, but they must be bold enough to study subjectivity 
(ignorance arising from missing information). Unfortu- 
nately, as is well known, nonignorable nonresponse models 
have nonidentifiable parameters. We discuss how the key 
nonignorability parameters are identified. We know that if 
the respondents and nonrespondents are similar, then the y, 
are equal unity, and we get the ignorable nonresponse model 
with all parameters identified. We can now expand the 
ignorable nonresponse model into a nonignorable non- 
response model by putting a distribution on these y, 
centered at 1, still maintaining identifiability. One can 
formulate a nonignorable nonresponse model to add 
flexibility to the ignorable nonresponse model as we have 
done in our work; the flexibility is a form of sensitivity 
analysis, coherent in this case, and indeed it is a Bayesian 
uncertainty (risk) assessment (e.g., Greenland 2009). This is 
what we have been doing or trying to do in our work. 

In this paper we attempt to solve the difficult problem of 
Nandram and Choi (2002a, b) in its original form for 7 x c 
tables for many areas. The plan of this paper is as follows. 
In Section 2, we describe the hierarchical Bayesian model. 
Specifically we describe the nonignorable nonresponse 
mechanism and we construct an appropriate prior distri- 
bution. In Section 3, we show how to fit the model using the 
sampling importance resampling (SIR) algorithm to 
subsample from an approximate posterior density after an 
innovative collapsing of the complete joint posterior density. 
In Section 4, we illustrate our methodology with public-use 
data from thirteen states in the third National Health and 
Nutrition Examination Survey (NHANES III). Section 5 has 
concluding remarks. 


2. The nonignorable nonresponse model 


For the problem of nonresponse in a two-dimensional 
table, we can have both item and unit nonresponse. Thus, 
one may consider the full data array to consist of four tables: 
one for complete data and three supplemental tables - one 
for missing row information, one for missing column 
information and a table for which neither row nor column 
membership has been recorded. Throughout this paper, we 
mex tows by jf =1;....7%3- columns, byi k= 1,....¢; and 
the four tables by s=1,2,3,4. We index areas by 
i=1,2,..., A and individuals within areas by /=1, 2,..., 
n,;. We next describe the nonignorable nonresponse model 
(i.e., the expansion model). 
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2.1 Sampling process 


We adapt the terminology and definitions used in 
Nandram, Cox and Choi (2005) to our situation. For sample 
individual / in area i, let 


— i 
Tina i 0, 


and let J, denote one of the 4-tuples (1,0, 0,0), 
(0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1). We assume that 


def 


) eo OY ray beat Mme 2 


if the outcome category is (/, k) 


otherwise, 


iid (1) 
k=1,...,c})| p, ~ Mult{], p,} 
and 


Ji | ie = 1, Lig = 9 for all 7’ ¥ j 


iid 
and k'#k|m,,}~ Mult{L 7,3, (2) 
def 
where p, = vec({Piglf =1, 2,....75 ke A cer Sara 
vector of probabilities for the table of rc categories for the 
variable of observation which must sum to one and, for cell 
(j, k) in that two-dimensional table, 


def 


Tepe VCC ey PlOraye ay 34) 


is a vector of probabilities which must sum to one. 
Next, we define cell counts y,,, for each table 
s=l,...,4 for area i-such that, for cell (j,k), 


nN; 
(Vir je Vir jhe Vis jko Via jk) ee: yy Let Ji, 
i=l 


where Y,,, are observed and y,,,, for s=2,3,4, are 
latent variables which satisfy the observed constraints 
Lie Viaje = Mis Lj Vian = Vn AN Lk Vinge = W- All 
inferences will be conditional on the observed quantities, 
U,V, and w,; But see Nandram (2009) for the analysis of 
a single rxc table under nonresponse when the margins 
are also random. We will denote the vector of the y,,, by 
y,, the vector of the y,,, 8 =2,3,4, by yj), and the 
complete vector by y=(Y, Yay) 

The parameters 7,,, are not identifiable. If the 
distributions of these parameters are known completely, 
then the nonidentifiability will disappear. Thus, the key 
issue is how to identify these parameters. We know that if 
the respondents and nonrespondents are similar (i.e., the 
four patterns, complete and partially complete tables), then 
we can take 1,,, =7,; this is the ignorable nonresponse 
model. The 7, can be estimated by the proportions of cases 
falling in the four tables for each area. This is a natural point 
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to start. To expand the ignorable nonresponse model into a 
nonignorable model, and still maintain identifiability, first 
we need a simplification. We take 1,,, = Wj, 7, which 
gives a nonignorable nonresponse model in which the 
parameters y,, are not identifiable. 

To center the nonignorable model on the ignorable 
model, we take 


is i Vink Mis> 
iste 
Wiig Mis > 


for s=l, 


for s = 2,3,4, .) 


and require that )*_, 7, =1. A little algebra then yields the 


relationship 
ss ht. 
Wiha — [ oe VW ik (=) Ti 
Ti 


~ Aix (Tis Wiig) Wie Bi (4) 
where ix (iy Win) = {Wi “ (Wie - -1) (mq ri 1)}, from 
which it is clear that iy jx — 1 if, and only _ Wik = 1. Note 


that since O<1,,<1 and (1- T,) <min {r, 38 
2,3,4}, it follows that 0<yy <U- Tesi)ies 

By combining (1) and (2) and noting the definition of 
Tt, in (3), similar to binary case, we get a multinomial 
distribution for y conditional on 7, y, p, and the likelihood 
function for the sample can now be seen to be 


S(y|%, YW, Pp) 
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Collecting factors which are powers of 7,,, the likelihood 


function may also be expressed as 


is? 
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where 0< 7, <1,2, 7, =1 and O<y, <(U- 1%, )'. Here 
we note that y,, and y,, are observed variables but the 
Y;,;, are latent variables. 


2.2 Prior construction 


The following assumptions describe the prior distribu- 
tions for the nonignorable nonresponse model: 


1. For the vector of cell probabilities p,, we assume that 
iid 
P;| W,, T, ~ Dirichlet(,7, ), 


where by = (Hii Bars +++ Mares Haat +++ Mare) 3 Maja 2 95 
and 1) U4-1 Hy, =1. The parameter 1, informs us of 
similarity among the p,: the larger t,, the more alike 
the p,. This is true because large t, means that the 
variances of the p, are small, and because they have the 
same mean, this means that they are more similar with 
larger T,. 

Thus, the density for p is 


Shu. Masti aes [sue |M,,7;) 


(7) 


where, for a k-tuple c and a scalar ¢ 


k 
[]'c@ a 
Daj 
(ct) rw 
for c, >0 and Dine c, =1. 


2. Independently of the p,, the a 
follow the specification 


j= (Mas Nis Mas M4) 
iid } 
m,~ Dirichlet(p,T, ), 
with m,20 and >, 7, =1, where p, =(U),U», 
[535 Lo, jy. pis e0s ay U,, = and t, is a measure of 
similarity among the 2,. Thus, the density for 7, is 


4 
Tne 
ee (aga a (8) 
D(u 2t>) 
3, “Forveacht7, et y= Qn. 


W,..) so that y=(yy,.. 
that the w,, are independently and 


» Wires Wiop eres Vie ata 
.,w',)’. We assume for each i 
identically 
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distributed in accordance with a distribution derived 
from the Gamma(, 8B), where the support is confined 
to the open interval (0,(1—7,,)'); in other words, the 
ordinary gamma distribution is truncated as 


ind 
Wx |B, 7, ~ Gamma(B, B) 
such thatO<yw,, <(1-1, ae 


It is worth noting that these w,, are identically 
distributed over j and k. Again, one can use other 
distributions such as a truncated lognormal density, but 
this will make little difference. In this formulation, there 
is some information about B because the small areas are 
assumed to share a common effect. 

Thus, for area i, the density for w, 1s 


83,(W; |B, Tt; ) 


—BWix Pel BW ix 
ye e : (lm, ) B wie 
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For 0<w, <(U-%, t,) . Making the transformation 

ti =BW,,, One can see that the normalizing constant in 
the denominator of each of the factors in g;,(w, | B, 71) 
is G,[BU- x, )'], where G,(-) is the gamma function 
with scale parameter 8. To eliminate the dependence of 


the range of integration on 7, let %, =(-2)Win 
malet oP yal Deeetenth, irinith Dodsteras Mie.) een 
23:(9|B, 7;) 
' Boi 
Bie era 
0 
Foie h Waa a T'(B) G,[B d- Ty). a 


for 0<,, <1. The joint prior for 2, and 9, is just the 
product of g3;(0, |B, a;) and g,,(7; |W, T,). Thus, the 
joint prior for 0=(,,...,,)' and 71 is 


def A 
g (7, | H>,T,B) = [Tigs@ |B, T;)* Sy; (7;|Mo.T)}- 


That is 
g(t, | L5,7,B) 


ae 
BP bi, e ice 
(1-7,)° F(B) G,[BU-2,)']} 


(10) 


The description of the model is completed by specifying 
the assumptions on the hyperparameters. As there are no 
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conjugate priors, we use shrinkage priors for t,,t, and 
because these are proper and noninformative. Priors of the 
form p(t,)«1/t,, and specifically proper diffused 
gamma priors, are discouraged; see, for example, Gelman 
(2006). Other alternatives are half Cauchy densities and 
gamma densities (one would need to specify the hyper- 
parameters). Thus, we take 


1. 1,,t, and 8 have independent shrinkage priors of the 
form 


a 
f(x) = —— for x > 0, 

(One) 
where a, is specified; it is standard practice to take 


ay =1. 


2. We also assume that p, ~ Dirichlet(1,1,..., 
uw, ~ Dirichlet(1, 1, 1, 1). 


1) and 


Let Q=(B, p,,7,, W,T,). Then the density for Q is 


Sima Smet eit 9 ee 


(ay +)" (by +75) man 


for t,t, and B20, 54 Hy, =1 and Ye) eek 
By Bayes’ theorem, the joint posterior density is 


h(Q, p, 1, 0%) | Vi, U,v,W) oe 
S(y |4.0,P)2,(P |My.7,) Z (7, Oly, Tt, B) p(Q) 
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where, substituting (1—7,,) Oj, for Wiig, 


Y « ec pel f Ty it +—{1 = Ty -t}} (12) 
ijk il 


To make inferences about the Pio We will draw 
samples from h(Q, p, 7, 0,¥)|¥).4¥,) using Markov 
chain Monte Carlo methods. This procedure is described in 
Section 3. 
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3. Computations 


We use the SIR algorithm to subsample a random sample 
from an approximate posterior density. There are three steps 
to accomplish this task. We collapse over the p,, 7, and ,, 
approximate the collapsed density by a simpler one and 
sample from it, and then subsample these samples to get 
samples from the original density. We show how to do these 
three steps in this section. 

To obtain the approximation and to simplify the 
computations, in Appendix A we collapse over the p,, 7 
and @, to get 


A 
Tt, (Q Yay | y,uvw) [YT 


| 


AQ, Yay | Y,,U,V,W) = 


where 
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with b, = min{{1/,,}, j=1,...5k=1,...,c} and 
Te, (02, Wy | YjW¥.") 
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To evaluate J, foreach i=1,..., 
follows given (2, yj,,)): 


A, we proceed as 


1. Draw independent samples of vectors 2, and , from 
the Dirichlet(y’ +p,1,) and Dirichlet(y +8 f), 
respectively. For each m, and ,, draw a sample of 
values for W, from the truncated gamma distribution on 
the interval (0, {Bb,/l—7,,}) with parameter rc. 


2. For each 7,6, and W. selected in step (1), compute 
R,R,, where 
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3. Repeat steps (1) and (2) 1,000 times. Then compute the 
average of R,R, over these 1,000 values. 


and 


The rest of our computation has two parts. First, we use 
the griddy Metropolis-Hastings sampler to draw from 
T(Q, Vay | Y>4¥,w). We sample p,,M,,7, and t, from 
their conditional posterior densities using grids; this entails 
transforming t, and 7, to the unit interval (0,1). For each 
distribution, 100 grids are used; see Nandram, Cox and Choi 
(2005) for a similar procedure. Here y,) is drawn by 
sampling from its conditional probability mass function 
component wise. Draws are made from the conditional 
posterior density of B using a Metropolis step in a manner 
similar to Nandram and Choi (2002a, b). We have per- 
formed this algorithm 11,000 times and we allowed a “burn- 
in” of 1,000 iterates. We found that the autocorrelations 
among the iterates was small, thereby indicating strong 
mixing of the sampler. We have also used the batch-means 
method to further assess the computation. We used batches 
of 25 to compute numerical standard errors. 

Second, we use the SIR algorithm to subsample the 
sample of 10,000 iterates we obtained from 7, (Q, 
Vay | ¥,,4,¥,w). For each of the 10,000 iterates we calculate 
the weights 


— AON”, Vay Yn YW) 
m , (QQ m) ‘ Vase ce u,v, w) 2 
m=1,...,M =10,000, (17) 


and we resample {Q’””, ae} with probabilities propor- 
tional to the weights w,, for m=1,..., M without replace- 
ment. We use a 10% sampling, and we subsample the 10,000 
iterates to get 1,000 iterates; sampling without replacement 
is a good idea because it avoids repeated values which 
already exist because the Metropolis-Hastings sampler is not 
really an accept-reject sampler and it gives repeated values. 
As usual with sampling without replacement the weights are 
calculated every time a value is selected. 

Finally, we can now make exact (within limitations of 
Markov chain Monte Carlo pele inference about p; 
a posteriori. Letting; = = Vise and y, denote the 
vector of y,.,- Then, 


ind 
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Thus, for each value of Ve l,, and t, we obtain from the 
SIR algorithm, we draw a value of p,,i=1,..., A. Thus, 
we obtain a Rao-Blackwellized density for each of the p,, 
and inference proceeds in the usual way. 


4. An illustrative example 


Our illustrative example is in health statistics. In Section 
4.1 we briefly discuss the data we used from the third 
National Health and Nutrition Examination Survey 
(NHANES III). Specifically, we study the relationship 
between bone mineral density and family income; see 
Nandram, Cox and Choi (2005) for a discussion of this 
problem. In Section 4.2, after briefly discussing our 
computation, we present posterior inference on the cell 
probabilities. In section 4.3 using the Bayes factor we 
discuss the relation between BMD and FI. 


4.1 NHANES III data 


The sample design is a stratified multistage probability 
design which is representative of the total civilian non- 
institutionalized population, 2 months of age or older, in the 
United States. Further details of the NHANES III sample 
design are available (National Center for Health Statistics 
1992, 1994). The NHANES III data collection consists of 
two parts: the first part is the sample selection and the 
interview of the members of a sampled household for their 
personal information, and the second part is the examination 
of those interviewed at the mobile examination center 
(MEC). The health examination has information on physical 
examination, tests and measurements performed by techni- 
cians, and specimen collection. The sample was selected 
from households in 81 primary units across the continental 
United States during the period from October 1988 through 
September 1994. The final data for this study is a part of the 
35 largest primary sampling units with population at least 
500,000, and we consider 13 subnational areas. 


SOI, — ico it ese FS Ee) 
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Nonresponse occurs in the interview and examination 
parts of the survey. The interview nonresponse arises from 
sampled persons who did not respond for the interview. 
Some of those who were already interviewed and included 
in the subsample for a health examination missed the 
examination at home or at the MEC, thereby missing all or 
part of the examinations. 

Doctors believe that obese and overweight individuals do 
not generally turn up at the MEC. Cohen and Duffy (2002) 
point out that “Health surveys are a good example, where it 
seems plausible that propensity to respond may be related to 
health.” NHANES III is a good example. 

Sampled persons in NHANES III can be categorized by 
many types of attributes, and researchers analyze such 
categorical tables for goodness of fit or independence. Here 
we study bone mineral density (BMD) and family income 
(FI). We note here that while FI is a discrete variable, we 
have classified BMD into three levels (normal, osteopenia 
and osteoporosis), and FI into three levels (low, medium and 
high). However, only partial classification of the individuals 
is available because some individuals are classified by only 
one attribute while others are not classified. About 62% of 
the households have both FI and BMD observed, 8% with 
only BMD observed, 29% with only income observed, 1% 
with neither income nor BMD among those participated in 
the examination stage. Our problem is to estimate the cell 
probabilities and to test for association between BMD and 
FI for each of 13 subnational areas using our expansion 
model that pools the data adaptively. 

In Table 1 we present the 3x3 tables of BMD and FI 
for the aforementioned 13 areas. Note that areas 6 and 48 
have enough data so that they can stand by themselves. 
However, the other areas are very small; the counts in the 
table with row totals are generally small except for area 17 
and the counts in the table with just total are small. Even for 
the table with complete data the cell counts are generally 
small forcing us to use small area estimation techniques to 
borrow strength. 


Column Total Row Total Total 
lit} 5 6 4 0 1 l 
178 54 82 65 28 4 20 
18 1] 16 5 6 B l 
18 10 16 17 2 2 4 
9 6 12 1] 4 5 | 
10 5 1] 4 3 0 | 
9 D, 9 0 2 4 | 
43 aN| 42 9 i 6 l 
9 i 5 D 3 0 0 
35) 15 24 3 l 0 0 
19 4 12 7 | 0 | 
88 12 23 16 8 2 14 
9 4 8 2 4 1 0 


Table 1 
Counts of the 3x3 tables of BMD and FI corresponding to 13 subnational areas in NHANES II 
State Complete Table 

4 yA 14 9 8 7 3 2 2 
6 Zo «AZ TEANTOG 2 51 ows 32 5 
12 33 18 Zt 22 4 4 15 5 
17 25 if 13 8 5 3 0 0 
25 y) i 12 6 5 9 2 ] 
26 18 11 18 6 5 9 2 1 
29 9 4 10 3 2, 4 3 l 
36 42 17 os 32 13 18 9 6 
39 8 6 14 2 5 4 3 0 
42 14 8 11 12 8 + 8 l 
44 12 9 6 8 5 0 5 1 
48 130 44 Ps | 11 13 9 6 
53 14 10 15 10 10 14 3 1 

Note: 


In the complete 3x3 table the first (second, third) set of three numbers is the first (second, third) row; the column (row) total refers to the 


3x3 table with only column (row) totals; the total refers to the 3x3 table with only total. 
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4.2 Posterior inference of the cell probabilities 


We discuss the performance of our computations for the 
expansion model, and then we discuss posterior inference 
about the cell probabilities. We use the posterior mean 
(PM), posterior standard deviation (PSD) and 95% credible 
interval for each of parameters of interest. We also present 
the numerical standard errors (NSE) to assess the 
repeatability of our computations. 

In Table 2 we present summaries of the posterior 
distributions of f,,M,,7,,T, and B, both before and after 
the application of the SIR algorithm. These summaries are 
very similar indicating that the SIR approximation 
Tt, (QV) | Yj,4¥,.W) is not unreasonable. For example, a 
95% credible interval for 8 before and after the application 
of the SIR algorithm are respectively (1.081, 1.940) and 
(1.086, 1.947), very good agreement. The estimates of 1, 
and t, should show the largest discrepancies, but these are 
also reasonably close [e.g., for t, 95% credible intervals 
with the approximation is (28.282, 64.204) and with the SIR 
algorithm it is (27.962, 64.425)]. In both cases the NSEs are 
small indicating that the computations are repeatable. 

In Table 3 we have compared our expansion model 
(Model 3) with two other models. Model 1, an ignorable 
nonresponse model, and Model 2, a nonignorable non- 
response model (no centering), are described in Appendix 
B. For illustration we have selected three areas, a large area, 
a medium-sized area and a small area. There are differences 
among the three models. In general, the larger estimates 
tend to be smaller for Model 2, and even smaller than Model 


Table 2 


1, than for Model 3 (i.e., the estimates from Model 3 are 
naturally closest to Model 1, and not Model 2). Model 2 
produces the largest variability; as expected, Model 3 gives 
slightly larger variability than Model 1. Because of space 
restrictions we have not presented the NSEs, but we note 
that they are all smaller than 0.005. 


4.3 Bayes factor for evidence of association 


We have also considered the association between BMD 
and FI. It appears doubtful whether such an association 
might exist, but it is interesting to look at this issue; see 
Nandram, Cox and Choi (2005) for a discussion on this 
problem. We use the Bayes factor (Kass and Raftery 1995) 
to measure the strength of the evidence of an association 
relative to no association in the rxc categorical table. We 
have done so for each of the thirteen areas and all areas 
combined. 

We have used two procedures, one without extensive 
modeling and the other using our nonignorable nonresponse 
(expansion) model. The simple method is to fill in the cell 
counts using an ordinary raking procedure, and we assume 
there is no error in doing so. This is a common sense 
procedure that survey practitioners have used routinely. In 
the second procedure using our nonignorable nonresponse 
model, we have obtained 1,000 combined tables for each 
area as described in Section 3 on computations. For each 
area we have obtained the cell counts for all four tables, and 
we summed them to get a single table of all counts. 


NHANES data on 13 areas: Comparison of the approximate posterior density and the correct posterior density using the posterior means 
(PM), posterior standard deviations (PSD), numerical standard errors (NSE) and 95% credible intervals of the hyperparameters 


Approximation Adjusted 
PM PSD NSE 95% Int PM PSD NSE 95% Int 
Ly 0.528 0.031 0.001 (0.463, 0.582) 0.525 0.031 0.008 (0.456, 0.578) 
L49 0.131 0.021 0.001 (0.096, 0.181) 0.133 0.021 0.002 (0.094, 0.179) 
L153 0.328 0.028 0.001 (0.274, 0.383) 0.328 0.028 0.005 (0.269, 0.383) 
Lo4 0.013 0.006 0.000 (0.004, 0.027) 0.014 0.006 0.000 (0.004, 0.029) 
T> 21.638 9.559 0255 (8.347, 46.587) 20.078 8.632 0.303 (8.538, 38.625) 
Midd 0.280 0.023 0.001 (0.234, 0.324) O27? 0.023 0.004 (0.228, 0.319) 
Lyi 0.133 0.016 0.000 (0.102, 0.165) 0.134 0.017 0.002 (0.101, 0.165) 
L113 0.200 0.019 0.000 (0.163, 0.238) 0.199 0.019 0.003 (0.162, 0.236) 
ope 0.105 0.015 0.000 (0.078, 0.135) 0.107 0.015 0.002 (0.079, 0.135) 
Lind 0.065 0.011 0.000 (0.044, 0.088) 0.065 0.011 0.001 (0.044, 0.087) 
L193 0.072 0.012 0.000 (0.050, 0.096) 0.073 0.012 0.001 (0.049, 0.097) 
L143] 0.061 0.011 0.000 (0.041, 0.083) 0.061 0.011 0.001 (0.040, 0.083) 
130 0.037 0.008 0.000 (0.023, 0.054) 0.036 0.008 0.001 (0.022, 0.054) 
L133 0.048 0.009 0.000 (0.031, 0.068) 0.048 0.009 0.001 (0.031, 0.068) 
T 45.960 10.094 0.153 (28.282, 64.204) 45.177 10.562 0.679 (27.962, 64.423) 
B 1.472 0.218 0.004 (1.081, 1.940) 1.449 0.208 0.022 (1.086, 1.947) 


Note: The hyperparameters are [), M5, 7,7 and B. 
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Posterior means of the cell probabilities and 95% credible intervals (CI) for three areas (large, medium and small) by the three models 


Table 3 
Model 1 

Cell PM PSD 95% CI PM 
a. Large 
(1,1) 0.239 0.044 (0.157, 0.326) 0.196 
(1.2) 0.140 0.035 (0.078, 0.213) 0.127 
(1,3) 0.240 0.044 (0.159, 0.332) 0.198 
(2,1) 0.092 0.032 (0.039, 0.162) 0.098 
(2,2) 0.074 0.028 (0.029, 0.136) 0.077 
(2,3) (Onrss 0.036 (0.070, 0.210) 0.121 
(3,1) 0.036 0.020 (0.008, 0.083) 0.069 
(3,2) 0.023 0.015 (0.003, 0.061) 0.043 
(3,3) 0.025 0.017 (0.003, 0.066) 0.071 
b. Medium 
1,4) 0.233 0.034 (0.169, 0.302) OOS 
(1,2) 0.143 0.028 (0.093, 0.200) 0127 
(eS) 0.190 0.031 (0.132, 0.254) 0.140 
(231) 0.174 0.031 (0.118, 0.237) 0.160 
(2,2) 0.043 0.018 (0.015, 0.083) 0.060 
(2:3) 0.049 0.020 (0.017, 0.095) 0.065 
(3,1) 0.112 0.025 (0.068, 0.167) 0.120 
(3,2) 0.047 0.018 (0.018, 0.088) 0.059 
(3,3) 0.010 0.009 (0.000, 0.033) 0.056 
c. Small 
C1) 0.196 0.052 (0.103, 0.305) 0.164 
C2) 0.081 0.034 (0.028, 0.158) 0.081 
(1,3) 0.213 0.052 (0.118, 0.323) Onle5 
(2,1) 0.093 0.041 (0.028, 0.186) 0.111 
2) 0.056 0.029 (0.012, 0.126) 0.066 
(2,3) 0.115 0.045 (0.042, 0.215) 0.118 
(3,1) OS) 0.048 (0.036, 0.222) 0.113 
Gr) 0.044 0.028 (0.006, 0.113) 0.065 
(3,3) 0.087 0.042 (0.022, 0.184) 0.107 


Model 2 Model 3 

PSD 95% CI PM PSD 95% CI 
0.046 (0.117, 0.295) 0.259 0.038 — (0.189, 0.335) 
0.035 (0.068, 0.200) 0.132 0.029 (0.082, 0.197) 
0.047 (0.118, 0.301) 0.248 0.037 ~~ (0.175, 0.322) 
0.040 (0.037, 0.188) 0.077 0.022 ~—-(0.039, 0.126) 
0.030 (0.030, 0.144) 0.056 0.020 (0.024, 0.099) 
0.042 (0.056, 0.219) 0.110 0.028 (0.058, 0.168) 
0.039 (0.013, 0.153) 0.047 0.018 (0.018, 0.086) 
0.025 (0.007, 0.100) 0.032 0.014 (0.009, 0.063) 
0.040 (0.010, 0.154) 0.042 0.016 (0.016, 0.079) 
0.043 (0.141, 0.305) 0.254 0.032 (0.194, 0.318) 
0.032 (0.072, 0.196) 0.146 0.024 (0.102, 0.197) 
0.034 (0.084, 0.218) 0.208 0.027 (0.156, 0.259) 
0.042 (0.092, 0.249) 0.154 0.027. (0.106, 0.211) 
0.028 (0.017, 0.124) 0.032 0.012 (0.012, 0.059) 
0.031 (0.018, 0.136) 0.042 0.014 (0.020, 0.072) 
0.041 (0.059, 0.209) 0.092 0.020 (0.056, 0.134) 
0.026 (0.019, 0.118) 0.040 0.014 (0.018, 0.071) 
0.032 (0.006, 0.122) 0.032 0.012 (0.013, 0.059) 
0.055 (0.077, 0.288) 0.253. 0.043 (0.175, 0.334) 
0.032 (0.030, 0.155) 0.091 0.028 (0.043, 0.152) 
0.055 (0.087, 0.300) 0.220 0.043 ~—- (0.137, 0.306) 
0.055 (0.029, 0.234) 0.073 0.028 (0.030, 0.139) 
0.031 (0.018, 0.136) 0.045 0.020 ~—- (0.014, 0.094) 
0.053 (0.038, 0.240) 0.092 0.030 (0.041, 0.158) 
0.056 (0.031, 0.239) 0.081 0.030 (0.033, 0.148) 
0.035 (0.013, 0.144) 0.043 0.020 ~—-(0.012, 0.086) 
0.055 (0.023, 0.227) 0.103 0.034 (0.047, 0.181) 


Note: See Appendix B for a description of Models | and 2. 


The raking procedure to obtain the cell counts is de- 
scribed as follows. Let 1, denote the cell counts for the 
four tables combined. Let n‘ denote the cell counts for the 
table with complete data, n’),, denote the table with row 
totals, ve denote the table with column totals and 
Des ,, denote the table with total. The cell counts for the 


four tables are estimated as 


(1) (1) () 
E50) Me Ale) Nik | @) Nik |) 
Nix Nix (1) Ne ol r+l,c+l? 
JT: 


ne, +| — 
+1,k 
n) r n\? 


Be reel Xe: 

In either case we denote the sum of the cell counts for 
each area by 1. For the raking procedure we have a single 
table for each area, and for the nonignorable nonresponse 
model we have a sample of 1,000 tables for each area. We 
also have a single combined table for all areas under the 
raking procedure and 1,000 tables for all areas combined. 
We obtain the Bayes factor for each table under a 
multinomial-Dirichlet model. It is worth noting that our 
method uses the expansion model so that the cell counts 
borrow strength from other areas unlike the raking 
procedure. 


Then, for each table we take 
n|7m~ Multinomial(n, 2) and 2 ~ Dirichlet(1). 


That is, we take a uniform prior for a with 7, >0 and 
dij-1 Lk-1 ™, =1. Under the hypothesis of no association 
we have 1, =1,7,, where 1, > O; 2 To = and 
Tt, > 0, y-, 1, =1. Thus, the hypothesis of association is 
that the 2, are unrestricted (except that they are 
nonnegative and sum to unity) whereas for the hypothesis of 
no associate 1, = jT,. 

The Bayes factor is the ratio of the marginal likelihood 
under association versus the marginal likelihood under no 
association. This measures the strength of evidence of 
association versus no association; see Kass and Raftery 
(1995). Let p,(m) denote the marginal likelihood under 
association and p,(m) denote the marginal likelihood 
under no association. Then, letting 1, =)”, and 
nN, = Lij=| Nj. it is easy to show that 


il u+rc ee mile n.,! 


p(n) = pf (u+r)(u tc) ike Tia"! | 
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where Po(n) = nT (utrc)'. Observe that p,(”) is 
not a function of {n,,}. Thus, as a measure of association it 
is the deviation of [Tj7,.!Mpa7,! from Mpa Ma” ey 
that matters. However, we note that for the classical Pearson 
statistic for independence it is the deviations of 1, from 
n,n, that matter. But note that this test cannot be applied 
because many of the expected cell counts are smaller than 5 
under the hypothesis of no association and multinomial 
sampling. 

We present our results in Table 4 and in Figure | 
corresponding to the data in Table 1 for the cross- 
classification of BMD and FI. We have presented the 


Table 4 


logarithms of the marginal likelihoods (base e) and the 
Bayes factors; these are to be interpreted using the rule of 
thumb of Kass and Raftery (1995). 

In Figure 1 we can see that the box plots are all above 
zero except the one for the third area which provides no 
evidence for association; perhaps there is no evidence for 
association in area 42 (10 in figure) as well. A summary of 
these results are presented in Table 4. The Bayes factors 
show association in all areas, except area 12, and they are 
much larger under the nonignorable nonresponse model. 
Area 6 and all areas combined are elevated (336.3 vs. 5.8 
and 3,798.2 vs. 0.183). 


NHANES data on 13 areas: Comparison of the negative marginal likelihoods and Bayes factors or association of BMD and F1 from the 


raking procedure and the expansion model by area 


Raking _ Expansion 

area —In {po (5 —Intp,()} BF —Int p(y} BF 
4 26.19 23.07 22.855 23.5o.014 14.78 .169 
6 45.73 43.98 5.766 40.50.038 336.27) 1.465 
12 31.14 38.01 0.001 33.40.054 0.370.027 
17 29:43 27.03 8.134 27.00.026 10.270.191 
25 25.44 26.02 0.558 23.80.029 9.550.202 
26 26.89 23.18 40.562 23.9o.018 24.710370 
2g 23.21 20.87 10.301 21.30018 8.400.115 
36 34.99 36.09 0.330 33.10.064 21.139 .928 
39 peaeaoit 24.89 0.325 23.60,044 2.24068 
42 Poe 30.21 0.497 30.30.09 4.330255 
44 25.61 30.48 0.008 24.40 027 5.199.137 
48 38.83 35.34 32.650 39.10.60 2.15 0.081 
53 27 UL 24.82 9.865 24.20017 19.40.289 
All 53.43 32:18 0.183 46. 10.049 3,798.24 151 92 


Note: Area “all’ refers to all areas combined; the notation a, means that the average is a and the standard error is 5 over the 1,000 iterates; 


In{po(n)} is the same for both procedures. 


Evidence: Box plots above 0 


Fy 


Log Bayes Factor 


i ] 1 I i 1 ] 
8 25 10 1] 12 13 14 15 


area 


Figure 1 Box plots of log Bayes factors versus areas to measure evidence for association between BMD and FI 
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5. Concluding remarks 


The purpose of this paper has been to develop a 
methodology to analyze data from incomplete two-way 
categorical tables, each table corresponding to an area. We 
have done so by extending the Bayesian methodology of 
Nandram and Choi (2002a,b) for binary data to r x c 
categorical tables for small areas. We have constructed a 
new Bayesian nonignorable nonresponse model (i.e., 
expansion model) which is centered on the ignorable 
nonresponse model. We have used Markov chain Monte 
Carlo methods (specifically the griddy Metropolis-Hastings 
sampler) to fit the model. We have compared our model 
with an ignorable nonresponse model and a nonignorable 
nonresponse model. Finally, we have done an illustrative 
example on estimating the cell probabilities for the 3x3 
table of BMD and income over thirteen subnational areas. 

We have shown that there are differences among the 
three models. Using the data on BMD and FI, we have 
shown that our expansion model is a compromise between 
the ignorable nonresponse model and the nonignorable 
nonresponse model. Using the Bayes factor we have shown 
that there are differences between the tests of association for 
BMI and FI when the cell counts are estimated from our 
model and when using a raking procedure. In fact, owing to 
the borrowing of strength, we have seen that the evidence 
for association under our model is much stronger than the 
from the raking procedure. 

There are three additional avenues that we can explore. 
First, we can construct a model to incorporate systematic 
departure from ignorability. This task will need more costly 
field work to get the much-needed information. Second, it is 
also interesting to relax the assumption that the margins of 
the categorical table are fixed; see, for example, Nandram 
(2009) who looked at a single large area. Third, there can be 
further improvement in calibration (i.e., incorporating 
information about margins). 
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Appendix A 
Joint posterior density of the expansion model 


First, integrating the joint posterior density over p we 
get that 


91 
h(Q,1,,9, |p rages) asl in 
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Note that O< 7, $1,234 0, =1 and O<,, <1. We 
simplify the computation for 7, in (A.3) in two steps. 


First, in (A.3) we make the transformation 


ix ey Oi Daye ip ly. 
jal k= 
The new _ variables din satisfy the relationships 


0< bi, a ep ie a Oi =1 and the 7, are restricted so 


that 0<T,<1/ 4, fOf) 2p ie i ee teas Cae 


0<T, < rc. With this transformation we have 


Lean 
: jel Ca 
1= S(T Fs 
jk 


Ti 
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where b, = min{{l/o,j}7=1,...5k=1...,c}. 

Second, letting W,={B7,/1—7,,} and absorbing the 
factor B® /T(rcB) in J,, with some additional algebra, 
we have 
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Appendix B 
Ignorable and nonignorable nonresponse models 


Set W,, =1 in the expansion model to form the 
ignorable nonresponse model. For i=1,..., 4, we then take 
iid 
fel py t5S8 Dirichlet(pee 


and independently 


itd 
P; |, % ~ Dirichlet(p1,7, ). 


Also, p(t,) = {1/(1+1,)°}, t, = 0, p, ~ Dirichlet (1), 
p(t,)= {1/(1+t,)}, t, 20 and 1, ~ Dirichlet (1). Here 
we have independence at all levels and the vectors 1 are of 
the appropriate dimension with every coordinate equal to 
one. Note, that all the parameters of the ignorable model are 
identifiable and estimable. 


Set Tig, = Tis Wi In the expansion model to form the 
nonignorable nonresponse model. In this case, for 
LAs 

iiditnd 2 
Ti, |b, 7, ~ Dirichlet(p,7, ) 
and independently 


iid 


P;\-;, 7, ~ Dirichlet(p1,1, ). 


In this model, the parameters 7,, are not identifiable and 
we take Tt, ~ Gamma (a,,B,), where a, and f, are to 
be specified. The model specification is then completed by 
assigning T,,M, and pw, the same distributional properties 
as in the previous paragraph. 

As in Nandram, Cox and Choi (2005), a, and B, are 
specified as follows. The ignorable nonresponse model is fit 
to obtain a sample from the posterior density of t,. Then 
QO) and fP, are obtained using the method of moments. 
Nandram, Cox and Choi (2005) found that inference about 
P, is not very sensitive to the choice of these parameters. 
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Why one should incorporate the design weights when 
adjusting for unit nonresponse using response homogeneity groups 


Phillip S. Kott ' 


Abstract 


When there is unit (whole-element) nonresponse in a survey sample drawn using probability-sampling principles, a common 
practice is to divide the sample into mutually exclusive groups in such a way that it is reasonable to assume that each 
sampled element in a group were equally likely to be a survey nonrespondent. In this way, unit response can be treated as an 
additional phase of probability sampling with the inverse of the estimated probability of unit response within a group 
serving as an adjustment factor when computing the final weights for the group’s respondents. If the goal is to estimate the 
population mean of a survey variable that roughly behaves as if it were a random variable with a constant mean within each 
group regardless of the original design weights, then incorporating the design weights into the adjustment factors will 
usually be more efficient than not incorporating them. In fact, if the survey variable behaved exactly like such a random 
variable, then the estimated population mean computed with the design-weighted adjustment factors would be nearly 
unbiased in some sense (i.e., under the combination of the original probability-sampling mechanism and a prediction model) 
even when the sampled elements within a group are not equally likely to respond. 


Key Words: Double protection; Prediction model; Probability sampling; Response model; Sampling phase; Stratified 


Bernoulli sampling. 


1. Introduction 


In the absence of nonresponse, it is possible to estimate 
the mean of a finite population from a survey sample 
without having to assume a statistical model which, no 
matter how reasonable, may not hold true. This is done by 
assigning each element of the population a positive proba- 
bility of sample selection and creating estimators around this 
random-selection mechanism. Unfortunately, surveys taken 
in the real world often suffer from nonresponse. 

Two different types of models can be used in the face of 
unit (whole-element) nonresponse. One is a prediction or 
outcome model in which the survey variable of interest is 
assumed to behave like a random variable with known 
characteristics but unknown parameters. The other is a 
response or selection model where the very act of an 
element’s responding to a survey is treated as an additional 
phase of random sample selection. 

Conventionally, survey statisticians prefer response mod- 
els for two reasons. In addition to the convenience of 
response modeling allowing them to treat unit response as 
an additional phase of random sampling, a survey is usually 
designed to collect information on a number of variables 
from the sampled elements. Prediction modeling requires 
assuming a different model for each survey variable any one 
of which could fail. Response modeling, by contrast, requires 
only the assumption of a single model. This is no longer true 
when there is item (survey-variable-specific) nonresponse. 
Consequently, prediction modeling is more common when 


handling item nonresponse through imputation. That being 
said, item nonresponse is beyond the scope of this note. 

Under an assumed response model, the element proba- 
bilities of response are treated as unknown, which means 
that they have to be estimated from the sample. Typically, 
the response mechanism is assumed to be independent 
across elements and not to depend on whether the element is 
in the sample (each element has an a priori probability of 
response which becomes operational if it is selected for the 
sample). The simplest and mostly commonly used response 
model separates the sample, and implicitly the entire 
population, into mutually exclusive groups, called “response 
homogeneity groups” by Sarndal, Swensson and Wretman 
(1992) (the term “weighting classes” is more common; see, 
for example, Lohr (2009, pages 340-341)), and assumes that 
each element in a group 1s equally likely to be a unit respon- 
dent regardless of its probability of selection into the origin- 
nal sample, 7,. Thus, the response mechanism produces a 
stratified Bernoulli subsample with the groups serving as the 
Strata. 

Conditioned on the respondent sample sizes in the groups, 
a stratified Bernoulli subsample with unknown selection 
(response) probabilities is converted into a stratified simple 
random subsample with known selection probabilities: 
r,/n, for the elements in group g when that group has n, 
sampled elements, 7, or which respond. 

Although the conditional probabilities of response in 
group g under the stratified Bernoulli response model is 
r,/n,, We will see it is often better to multiply the design 
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weight, d, = 1/7,, for a responding element in the group 
not by n, /r,, but by 


24 


keS, 


(1) 


where S', is the original sample and R, the respondent 
subsample in group g. This adjustment factor can be dif- 
ferent from n,/7r, when the d, in group g vary. 

Little and Wardvarin (2003) claim that using the f,, is 
what is usually done in practice. They argue, however, “ites 
incorporating design weights into the adjustment factor in 
this way can “add to the variance”. 

In section 2, we develop the notation for estimating the 
population mean of a survey variable. Using the n,/r, 
produces a double-expansion estimator, while using he re 
produces a reweighted-expansion estimator. We can express 
both using a formulation in Kim, Navarro and Fuller (2006). 
From that expression, it is possible to see that if the survey 
variable roughly behaves like a random variable with a 
constant mean within each group regardless of the design 
weights, then using the /, will often be more efficient than 
using the n,/r,. In fact, if the survey variable behaved 
exactly like. enc a random variable, then the estimated 
population mean computed with the f, would be nearly 
unbiased under the combination of the original sampling 
design and this prediction model even when the response 
model fails. 

In Section 3, we show that empirical results in Little and 
Vartivarian (2003) are consistent with these arguments and 
offer some concluding remarks. 


2. The two estimators 


Suppose we want to estimate the population mean of a 
survey variable y,: 


is Vk 3 2 Yk 


G 
Ds N Yu 
— _ kU _ gal kel, g=l 


Nor aaa ay aa nC eae 
N 

Dagan salt Den ts 

Sal gal 
where the population U is divided into G groups, wba 
U,, each U, contains N, elements, and N = N, +. 
N,. In the msenice of nbateenonte: each N, is a cea 
in an unbiased fashion under probability- leqrpiite theory by 
ies Lees, 4, and each Y, is estimated in a nearly (Ze., 
a ploucatly) unbiased fashion 
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keS. 
— =n g ; 5 
LED by d, ( ) 


keS, 


under mild conditions when n, is sufficiently large. We 
assume both here. 

For a formal statement of the conditions under which 
each Y, is consistent under probability sampling theory 
and therefore nearly unbiased, see Fuller (2009, page 115). 
The interested reader is directed to Fuller whenever a result 
in this note depends on assumptions about the design and 
population as the sample size grows arbitrarily large. A 
more rigorous treatment of much of what is to be discussed 
here under the response model can be found in Kim, 
Navarro and Fuller (2006). 

Let us label the full-sample estimator for y,, we have 
been discussing ys = ©° N z Vs, There are more direct 
ways to render Y,, but this version will better serve our 
purposes. 

If we adjust for nonresponse using the f, in equation 
(1), we have the reweighted-expansion estimator: 


Technically, ),,. is the ratio of two reweighted-expansion 
estimators, but we use the simpler terminology here. 

Employing the n,/r, results in the double-expansion 
estimator: 


> (7 > to] 
he y keR, 
Vide. =~ 
Ny 
5 |e » | 
lg keR, 


gal\ gk 


For our purposes, this estimator can also be expressed as 


keS, 
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where 
Dalds 
pp be Lah Sebi poet Gs (3) 
dis n, 
(so that 2's dP, = Ls, a Sua 


Both y,,, and y,, can now be. written in the form: 


Is > Ae Ve, 
yy} N, keR, 
ENCE G, CCP 


g=l 
Us. keR, 
yy Se (4) 
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For the reweighted-expansion estimator, all g, = 1, while 
for the double-expansion estimator, g, = p, as defined by 
equation (3). 

We will soon have use of the following for our two 
estimators: 


A keR ~ keER, 
g=l i 33 dy, g=l : au AA 
A = keR, Ke keR, (5) 
J, Me , ~ D) 
i ; G » eS SN 
> N, keR, os g 
pl ae yy di. 4, 
keR, 


where e, = y, — ¥,. Equation (5) holds exactly when all 
q, = 1. When q, = p,, the near equality depends on the 
r, being sufficiently large and other mild conditions. 

Now assume the following response model holds: Each 
element k in a group has an equal, positive probability of 
response that does not vary with m, or with y,. That is to 
say, the response indicator p,, which is | when k responds 
if sampled and is 0 otherwise, is a Bernoulli random 
variable with a common mean in U, regardless of the 
values of 7, and y,. 

By treating unit response as a second phase of probability 
sampling in this way, the added variance/mean-squared- 
error due to nonresponse given the original sample and the 
r, for both estimators can be expressed as 


AS PEE FI" SPAY 


G 
2.4 a 

DN. Var, (25 4 |S; r,) (6) 

g=l 


where Cong = Vcr Dae Cp SNS Ht Ts and 
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under mild conditions on the population and original 
sampling design we assume to hold, including (again) that 
the r, are sufficiently large. These conditions make both 
estimators nearly unbiased under quasi-probability sampling 
theory (probability theory augmented with a response 
model) and render the distinction between large-sample 
variance and mean squared error moot. Quasi-probability 
sampling theory is also known as “‘quasi-design-based” and 
““quasi-randomization-based” sampling theory. 

Looking at equations (6) and (7), we see that at one 
extreme ye has an added variance due to nonresponse of 
(approximately) zero when all the originally sampled y, in 
a group are equal, while at the other Vs, has an added 
variance of zero when all the originally sampled d,e, (or, 
put another way, the d,[y, — ¥,]) ina group are equal. 

Heuristically, the reweighted-expansion estimator 1s 
more efficient than the double-expansion estimator when 
é; 1s a better predictor of e, for k ¢ S, than p, @; . 
Thus, when the groups were constructed as adunees in Little 
and Vartivarian (2003) and earlier in Little (1986) so that 
the y, in a group are homogeneous (as opposed to the 
d.[y, — Ys] being homogeneous), then the reweighted- 
expansion estimated computed with the f, will usually be 
more efficient than the double-expansion estimator com- 
puted with the , /r,. 

The heuristic observation can be formalized with an 
alternative justification for using the reweighted-expansion 
estimator. Suppose the following prediction model holds: 
Each y, in U, is a random variable with common mean, 
H,, regardless “of m, and p,. Then ¥,, is nearly unbiased 
under mild conditions with respect to the combination of the 
original sampling mechanism (which treats the d, as 
random, where d, = 0 for k # S) the prediction model 
(which treats the y, as random). That is to say, E,[E, 
(Vr ai, | S)] ~ 0, since the double expectation of both 
y,, and y,, are nearly YN gt, /2°N,. This combined 
unbiasedness is exact when the design 1 is stath that 35d 
N. Stratified, simple random sampling is an Bes anieads of 
such a design. Unstratified sampling with unequal proba- 
bilities and many multistage designs are not. 


Statistics Canada, Catalogue No. 12-001-X 


98 Kott: Why one should incorporate the design weights when adjusting for unit nonresponse 


It is not hard to see that V,,, is also exactly ae ae with 
respect to this double expectation (i.e, E,[E,(y y,- 
Ve |S)] = 0) when all the HW, are equal. In fact, the pre- 
diction-model expectation of both y,, and ¥,, equals this 
common mean, as does the prediction-model expectation of 
an estimator without any adjustment for unit nonresponse, 
that 1 iS, with the f, in y.,, replaced by 1. The advantage of 
¥,,, OVEF Vip under the prediction model obtains only when 
the p, vary, otha is, when the survey variable has a different 
prediction mean across the groups. 

Notice that if either the response model or the prediction 
model holds, then the reweighted-expansion estimator is 
nearly unbiased in some sense (i.e., under the combination 
of the original design and the response model or under the 
original design and the prediction model). This property has 
been called “double protection” against nonresponse bias. 
See, for example, Bang and Robins (2005). 


3. Concluding remarks 


In this note, we discussed two distinct types of models. 
We stressed a response model, which treats the response 
indicators, p,, as a Bernoulli random variable within each 
group but with unknown parameters. We also described a 
prediction model, which treats the survey values, y,, as 
random variables with unknown means that could vary 
across groups but not within them. 

As part of the response model, we assumed that within a 
group, the p, do not depend on the y,. Analogously as part 
of the prediction model, we assumed that within a group, the 
y, do not depend on the p,. When both p, and y, are 
treated as random variables the former assumption, that 
nonrespondents are missing at random, is equivalent to the 
latter assumption, that the response mechanism is ignorable 
(see, for example, Little and Rubin 1987). It should be under- 
stood, however, that the y, need not be treated as random 
variables under the response model and the p, need not be 
treated as random variables under the prediction model. The 
two concepts (missingness at random and ignorable non- 
response) may be equivalent in some sense but they are not 
identical. 

The heart of Little and Vartivarian (2003) is a series of 
simulations featuring a binary survey variable, two potential 
response groups, and two original selection probabilities. 
Both the survey variable and response indicators are gener- 
ated under five models. The expected value of each is a 
function of, 1, the response group only, 2, the selection 
probability only, 3, neither, or, 4 and 5, one of two equal 
combinations of response group and selection probability. 
This produces 25 scenarios, of which 10 are of primary 
interest to us. Those are the ones in which the survey 
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variable is a function either of only the response group or of 
neither the response group nor the selection probability. 

As our theory predicts when the survey variable is a 
function of neither the response group nor the selection 
probability, both the reweighted- and double-expansion 
estimators have empirical biases near zero (Table 5 in Little 
and Vartivarian) because both are nearly unbiased under the 
combination of the original sampling design and a valid 
prediction model: all population elements have the same 
mean. When the survey variable is a function of the re- 
sponse group and the response indicator is wholly or partly 
a function of the selection probability, only the reweighted- 
expansion estimator is nearly unbiased empirically since 
only it is unbiased under the combination of the original 
sampling design and a valid prediction model. As a result, 

',, also has less empirical root mean squared error and 
significantly less average absolute error as an estimator for 
VY, (Tables 4 and 6 in Little and Vartivarian, respectively; 
the significance test treats the mean value across the 
simulations of | Be, 7% J, 73 Px = J, | as asymptotically 
normal). 

When both the survey variable and response indicators 
are functions of the response group only, the reweighted- 
expansion estimator has slightly less empirical root mean 
squared error and average absolute error than the double- 
expansion estimator but the latter is not significant. 

It should not surprise us that the reduction in empirical 
root mean squared error is modest. The contribution to the 
variance from nonresponse under the response model 
mechanism expressed in equations (6) and (7) is conditioned 
on the original sample (technically, the contribution of non- 
response to the total quasi-probability variance of ye is 
the expectation of A, in equation (6) under the original 
sampling mechanism). In applications where the response 
rates are relatively large (in the simulations they averaged 
0.5), this contribution can be dominated by the probability- 
sampling variance/mean squared error of the full-sample 
estimator, V,,. 

Two warnings are in order. The respondent sample size 
within each group must be sufficiently large for the 
reweighted-expansion estimator to nearly unbiased under 
quasi-probability theory. For the double-expansion esti- 
mator, each r, need only be positive. Moreover, that the 
reweighted- -expansion estimator is doubly protected against 
nonresponse bias is only helpful when either the assumed 
response or prediction model is correct. If both the 
response probabilities and survey values vary with the 
design weights, then the reweighted-expansion estimator 
can be meaningfully biased. Despite the slant taken in this 
note, that is the take-away message Little and Vartivarian 
(2003) intended, and it cannot be disputed. 
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Survey Quality 


Lars Lyberg ' 


Abstract 


Survey quality is a multi-faceted concept that originates from two different development paths. One path is the total survey 
error paradigm that rests on four pillars providing principles that guide survey design, survey implementation, survey 
evaluation, and survey data analysis. We should design surveys so that the mean squared error of an estimate is minimized 
given budget and other constraints. It is important to take all known error sources into account, to monitor major error 
sources during implementation, to periodically evaluate major error sources and combinations of these sources after the 
survey is completed, and to study the effects of errors on the survey analysis. In this context survey quality can be measured 
by the mean squared error and controlled by observations made during implementation and improved by evaluation studies. 
The paradigm has both strengths and weaknesses. One strength is that research can be defined by error sources and one 
weakness is that most total survey error assessments are incomplete in the sense that it is not possible to include the effects 
of all the error sources. The second path is influenced by ideas from the quality management sciences. These sciences 
concern business excellence in providing products and services with a focus on customers and competition from other 
providers. These ideas have had a great influence on many statistical organizations. One effect is the acceptance among data 
providers that product quality cannot be achieved without a sufficient underlying process quality and process quality cannot 
be achieved without a good organizational quality. These levels can be controlled and evaluated by service level 
agreements, customer surveys, paradata analysis using statistical process control, and organizational assessment using 
business excellence models or other sets of criteria. All levels can be improved by conducting improvement projects chosen 
by means of priority functions. The ultimate goal of improvement projects is that the processes involved should gradually 
approach a state where they are error-free. Of course, this might be an unattainable goal, albeit one to strive for. It is not 
realistic to hope for continuous measurements of the total survey error using the mean squared error. Instead one can hope 
that continuous quality improvement using management science ideas and statistical methods can minimize biases and other 
survey process problems so that the variance becomes an approximation of the mean squared error. If that can be achieved 
we have made the two development paths approximately coincide. 


Key Words: Quality management; Total survey error; Quality framework; Mean squared error; Process variability; 
Statistical process control; Users of survey data. 
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Scheuren, Dennis Trewin, and Maria Bohata for helping 


This article has been prepared in recognition of Joe 
Waksberg’s unique contributions and leadership in survey 
methodology. My first encounter with Joe’s work was his 
article on response errors in expenditure surveys written 
with John Neter (Neter and Waksberg 1964). Among 
other things that article introduced me to the cognitive 
phenomenon called telescoping. Later in life I had the 
opportunity to work with Joe on the first conference and 
monograph on telephone survey methodology where we 
were part of the editorial group (Groves, Biemer, Lyberg, 
Massey, Nicholls and Waksberg 1988). We also collabo- 
rated on the preparation of many of the Hansen Lectures 
that were published in the Journal of Official Statistics 
(JOS) during my term as its Chief Editor. Joe himself de- 
livered the sixth lecture, which was published in JOS 
(Waksberg 1998). Joe was a fantastic leader and it is a 
great honor for me to have been invited to write this ar- 
ticle on survey quality, a topic that occupied his mind a lot. 

Many of my friends have conveyed their views or sent 
me materials in preparation of this article. Especially 
I want to thank Paul Biemer, Dan Kasprzyk, Fritz 


me. 

Survey quality is a vague, albeit intuitive, concept with 
many meanings. In this article I discuss some observa- 
tions related to the development and treatment of the 
concept over the last 70 years and for some developments 
it is possible to trace roots that can be found even farther 
back. Most of my discussion, however, concerns current 
issues in government statistical organizations. It is within 
official statistics that most my survey quality examples 
take place. 

The article is organized as follows: In Section 2 I 
discuss the total survey error paradigm, including error 
typologies, treatment of the errors, and survey design 
taking all error sources into account. In section 3 I discuss 
quality management philosophies that have had a large 
impact on survey organizations since the early 1990’s. 
This impact is manifested by methods and approaches like 
recognition of the user or the client, a discussion of costs 
and risks in survey research, and the need for organiza- 
tions to continuously improve. Section 4 provides exam- 
ples of quality initiatives in survey organizations. Section 
5 deals with the difficulties in measuring quality, either 


1. Lars Lyberg, Department of Statistics, Stockholm University, 10691 Stockholm, Sweden. E-mail: Lars. Lyberg@stat.su.se. 
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directly or indirectly via indicators. How these measures 
should be communicated to the users or clients is also 
covered. Section 6, finally, offers some thoughts about how 
survey practices must change to better serve the needs of the 
users. The last section contains references. 


2. The total survey error paradigm 
2.1 Some history of survey sampling 


There are a number of papers describing the development 
of early survey sampling methodology. In that early devel- 
opment there is an implicit or explicit recognition of quality 
issues although they are hidden under labels such as errors 
and survey usefulness (Deming 1944). The historical over- 
views provided by, for instance, Kish (1995), Fienberg and 
Tanur (1996), and O’Muircheartaigh (1997) all emphasize 
the fact that the period up to 1950 is characterized by a full- 
bloom development of sampling theory. During the 1920s 
the International Statistical Institute agreed to promote ideas 
on representative sampling suggested by Kiear (1897) and 
Bowley (1913). In 1934 Neyman published his landmark 
paper on the representative method. Later Fisher’s (1935) 
randomization principle was used in agricultural sampling 
and Neyman (1938) developed cluster sampling, ratio esti- 
mation and two-phase sampling and introduced the concept 
of confidence interval. Neyman showed that the sampling 
error could actually be measured by calculating the variance 
of the estimator. Bill Cochran, Frank Yates, Ed Deming, 
Morris Hansen and many others further refined the concepts 
of sampling theory. Hansen led a research group at the U.S. 
Census Bureau where much of the applied work and new 
theory development was conducted in those days. One 
remarkable result of the Census Bureau efforts was the two- 
volume textbook on sampling theory and methods (Hansen, 
Hurwitz and Madow 1953). As a matter of fact the advances 
in sampling theory were so prominent at the time that 
Stephan (1948) found it worthwhile to write an article about 
the history of modern sampling methods. 

It was early recognized that there could be survey errors 
other than those attributed to sampling. There are writings 
on the effects of question wording such as Muscio (1917). 
Research on questionnaire design was quite extensive in the 
1940s. Problems with errors introduced by fieldworkers 
collecting agricultural data in India were addressed by 
Mahalanobis (1946), resulting in a method for estimating 
such errors. The method 1s called “interpenetration” and can 
be used to estimate, so called, correlated variances intro- 
duced by interviewers, editors, coders and those who super- 
vise these groups. The most prominent error sources were 
certainly known around 1950. Deming had listed error 
sources (1944) that constitute the first published typology of 
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survey errors and Hansen and Hurwitz (1946) had discussed 
subsampling among nonrespondents in an attempt to pro- 
vide unbiased estimates in a situation with an initial nonre- 
sponse. But the methodological emphasis, up to then, had 
been on developing sampling theory, which is quite under- 
standable. It was very important to be able to show that 
surveys could be conducted on a sampling basis and in a 
variety of settings. By 1950 it had been demonstrated quite 
successfully that this was indeed possible. So it was time to 
move on to other issues and refinements. 

In those early days the use of the word quality was 
confined to mainly quality control, sometimes as quality 
control of survey operations. It was common that the quality 
control was verification and/or estimation of error sizes for 
various operations. Statistics were known to be plagued by 
errors other than those stemming from sampling but the 
process quality issue of how to systematically reduce these 
errors and biases was still to be developed (Deming 1944; 
Hansen and Steinberg 1956). 

The user 60 years ago was a somewhat obscure player, 
although not at all ignored by prominent survey methodolo- 
gy developers. For instance, Deming (1950) claimed that 
until the purpose is stated, there is no right or wrong way of 
going about a survey. Some other statisticians made similar 
statements. But the user was really hiding behind terms, 
such as subject-matter problem, study purpose or the key 
functions of a statistical system. 

Even now survey and quality are vague concepts. As 
pointed out by Morganstein and Marker (1997) varying 
definitions of quality undermine improvement work so we 
should, at least, try to distinguish between different defini- 
tions to see what purposes they might serve. One of the 
most cited definitions is attributed to Joseph Juran, namely 
quality being a direct function of “fitness for use’. It turns 
out that Deming already in 1944 used the phrase “fitness for 
purpose’’, not to define quality, but rather to explain what 
made a survey product work. 

For a long time “good” quality was implicitly equivalent 
to a small mean squared error (MSB), i.e., data should be 
accurate and accuracy of an estimate can be measured by 
MSE, which is the sum of the variance and the squared bias. 
We have noticed that survey statistics should also be useful, 
later denoted “relevant”. Many of today’s quality dimen- 
sions were not really an issue at the time. The users, too, 
were accustomed to the fact that surveys took time to carry 
out; timeliness was surely on the agenda but not as explicitly 
as it is today. A census took years to process. The users 
were accustomed to a technology that could only deliver 
relatively simple forms of accessibility. Hence, it was natu- 
ral for users and producers to concentrate on making sure 
that the statistical problem coincided reasonably well with 
the subject-matter problem and that MSE was kept on a 
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decent level, where MSE many times was and still is 
equivalent with just the variance, without a squared bias 
term added. 

Before proceeding any further, let us define “survey”. A 
survey is a Statistical study designed to measure population 
characteristics so that population parameters can be esti- 
mated. Two examples of parameters are the proportion 
unemployed at a given time in a population of individuals, 
and the total revenue of a business or industry sector during 
a given time period. A survey can be defined as a list of 
prerequisites (Dalenius 1985a). According to Dalenius a 
study can be classified as a survey if the following prereq- 
uisites are satisfied: 


1. The study concerns a set of objects comprising a 
population; 

2. The population under study has one or more mea- 
surable properties; 

3. The goal of the study is to describe the population by 
one or more parameters defined in terms of measur- 
able properties, which requires observing (a sample 
of) the population; 

4. To get observational access to the population a frame 
is needed; 

5. A sample of objects is selected from the frame in 
accordance with a sampling design that specifies a 
probability mechanism and a sample size (where n 
might equal N, the population size); 

6. Observations are made on the sample in accordance 
with a measurement process (i.e., a measurement 
method and a prescription as to its use); 

7. Based on the measurements, an estimation process is 
applied to compute estimates of the parameters when 
making inference from the sample to the population 
under study. 


This definition implicitly lists the specific error sources 
that are present in survey work. For each source there are a 
number of methods available that minimize the effects but 
also measure their sizes (Biemer and Lyberg 2003; Groves, 
Fowler, Couper, Lepkowski, Singer and Tourangeau 2009). 

Deviations from the definition reflect quality flaws. 
Moreover such deviations are common. In some designs 
selection probabilities are unknown or the variance esti- 
mator chosen might not be the most suitable one, given the 
sample design applied. Whether such flaws are problematic 
or not depends on the purpose. 


2.2 The components of the total survey error 
paradigm 


The total survey error paradigm is a theoretical frame- 
work for optimizing surveys by minimizing the accumulated 
size of all error sources, given budgetary constraints. In 
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practice this means that we want to minimize the mean 
squared error for selected survey estimates, namely those 
that are considered most important by the main stake- 
holders. The mean squared error is the most common metric 
for survey work consisting of a sum of variances and 
squared bias terms from each known error source. Groves 
and Lyberg (2010) provide a summary of the status of the 
paradigm in the past and in today’s survey practice. 

The idea that surveys should be designed taking all error 
sources into account stems from the early giants in the field. 
Morris Hansen, Bill Hurwitz, Joe Waksberg, Leon Pritzker, 
Ed Deming and others at the U.S. Census Bureau, Leslie 
Kish at the University of Michigan, P.C. Mahalanobis at the 
Indian Statistical Institute, and Tore Dalenius, Stockholm 
University were among those who took the lead in survey 
research, emphasizing errors and optimal design. They 
worried about the inherent limitations associated with 
sampling theory since nonsampling errors could make the 
theory break down. They were very practical and thought a 
lot about balancing errors and the costs to deal with them. 
Some of them saw similarities between a factory assembly 
line (Deming and Geoffrey 1941) and the implementation of 
some of the survey processes and introduced control 
methods obtained from industrial applications. 

Dalenius (1967) realized that there was as yet no “survey 
design formula” that could provide an optimal solution to 
the design problem. The approach taken by Dalenius and 
also Hansen, Hurwitz and Pritzker (1967) was a strategy of 
minimizing all biases and going for a minimum-variance 
scheme so that the variance became an approximation of the 
MSE. This was supposed to happen through intense 
verification schemes for ongoing productions and quite 
extensive evaluation studies for future productions. In 1969 
Dalenius, inspired by Hansen, presented a paper on total 
survey design, where the word “total” reflected the thought 
about taking all error sources into account. Hansen, 
Hurwitz, Marks and Mauldin (1951), Hansen, Hurwitz and 
Bershad (1961), and Hansen, Hurwitz and Pritzker (1964) 
developed the U.S. Census Bureau Survey Model that 
reflected contributions from interviewers, coders, editors, 
and crewleaders and allowed the estimation of those 
contributions to the total survey error. These estimation 
schemes were elaborated on by Bailar and Dalenius (1969) 
and consisted of variations of replication and interpenetra- 
tion. Bias estimation was assumed to be handled by com- 
paring estimates obtained from the regular operations with 
those obtained from preferred procedures (that could not be 
used on a large scale due to financial, administrative or 
practical reasons). Today this kind of approach is called the 
“sold standard”. 

It was stated that good survey design called for 
reasonably effective control of the total error by careful 
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specifications of the survey procedures, including adequate 
controls. Hansen, Deming and others did worry about 
control costs but although statistical process control and 
acceptance sampling had been implemented in a number of 
survey organizations, there was very little discussion about 
continuous process improvement. A lot of the quality work 
had to do with estimation of error rates, controlling error 
levels for individual operators and conducting large-scale 
evaluation studies that usually took a long time. Users were 
not directly involved in the design process but in the U.S. 
federal statistical system they had at least some influence on 
what should be collected and presented. Dalenius (1968) 
provides more than 200 references on users and user 
conferences associated with the products of the U.S. Federal 
statistical system. 

While total survey design was first advocated by Hansen, 
Dalenius and others, users were seldom directly involved in 
the final determination of survey requirements. Quite often 
an official, administrator or statistician acted as a subject- 
matter specialist. Several decades ago this was the way we 
thought about users. Their opinions counted but they were 
not really involved in design decisions. Lurking in the back 
of our heads was the thought that this might not be a perfect 
model and in the late 1970’s Statistics Sweden published an 
internal booklet called ‘““What to do if a customer shows up 
on our doorstep”. 

The basic design approach suggested by Hansen, 
Dalenius and others contained a number of steps including: 


Specification of an ideal survey goal. 

Analysis of the survey situation regarding financial, 
methodological and information resources. 

Developing a small number of alternative designs. 
Evaluating the alternatives by reference to associated 
preliminary assessments of MSE and costs. 

Choosing one of the alternatives or a modification of 
one of them or deciding not to conduct a survey at all. 
Developing the administrative design including feasi- 
bility testing, a process signal system (currently called 
paradata), a design document, and a Plan B. 


Kish (1965) had slightly different views on design. He 
liked the neo-Bayesian applications in survey sampling and 
psychometrics advocated by colleagues at the University of 
Michigan (Ericson 1969; Edwards, Lindman and Savage 
1963). For instance, Kish liked the idea that judgment 
estimates of measurement biases might be combined with 
sampling variances to construct more realistic estimates of 
the total survey error. Regarding the optimization problem 
Kish thought that the multipurpose situation was econom- 
ically favorable for surveys but that it could be difficult to 
decide on what to base the design on. If one principal 
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statistic can be identified then that alone can decide the 
design and if there are a small number of principal statistics 
a compromise design is possible but if statistics are too 
disparate a reasonable design might not exist. Kish also 
emphasized the need for design information obtained from 
pilot surveys and pretests to facilitate design decisions. Kish 
noted that survey design and measurement could vary 
greatly across environments while sampling did less so. 
That could be one reason that sampling can be easily placed 
among the traditional statistical theories and methods, while 
it is more difficult to place the survey process in one specific 
discipline (Frankel and King 1996 in their interview with 
Kish). 

Kish, like the other giants, emphasized the importance of 
small biases but appreciated the fact that the reduction of 
one bias term might increase the total error. Kish was keen 
on getting a reasonable balance between different error 
sources and how error structures varied under different 
design alternatives. Like Hansen and colleagues Kish 
thought that relevant information should be contempora- 
neously recorded during implementation (again we see the 
parallel to paradata). Hansen and colleagues were really 
concerned about excessive but inadequate controls. They 
realized that some controls might have to be relaxed due to 
limited improvements and that degree of improvement in 
terms of affecting the estimates should be checked out 
before any relaxation could take place. They also suggested 
that one might have to compromise relevance to get 
controllable measurements or abstain from the survey. Both 
Hansen and colleagues and Kish were vigorously in favor of 
ending the practice that sampling error is the only survey 
error measured. 

When we look at today’s situation we can conclude that 
we still do not have a design formula for surveys. There is 
no planning manual to speak of and the literature on design 
is consequently very small, as is the literature on cost 
(Groves 1989 is an exception). And no design formula is in 
sight. Since the advent of the U.S. Census Bureau survey 
model a number of variants have appeared on the scene, 
some of them quite complicated (Groves and Lyberg 2010). 
A common characteristic is the fact that they tend to be 
incomplete, i.e., they do not take all error sources into 
account. Most statistical attention is on variance compo- 
nents and especially on measurement error variance. There 
are a number of other weaknesses associated with the total 
survey error concept. Most notably a user perspective is 
missing and a vast majority of users are not in a position to 
question or even discuss accuracy. The complex error 
structures and interactions do not invite outside scrutiny and 
user contacts often tend to concern less technical issues such 
as timeliness, comparability and costs. Users are not really 
informed about real levels of accuracy and we know very 
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little about how users perceive information about errors and 
how to act on that. 

As pointed out by Biemer (2001), in his discussion of 
Platek and Sarndal (2001), there is a lack of routine mea- 
surements of MSE components in statistical organizations. 
There are good reasons for this state of affairs. Complexity 
has already been mentioned and to that we can add factors 
such as costs, the fact that it is almost impossible to publish 
such information at the time data are released, and that there 
is no measure of total error that would take all error sources 
into account, either because a lack of proper methodology or 
that some errors defy expression. Groves and Lyberg (2010) 
list some other weaknesses associated with the total survey 
error paradigm. For instance, we need to know more about 
the interplay between variances and biases. It is possible that 
an increase in simple response variance goes hand in hand 
with a reduction in response bias, say, when we compare 
interview mode with self-administrative alternatives. Re- 
cently, West and Olson (2010) showed that interviewer 
variance can occur not only from individual interviewers’ 
effect on the responses within their assignments but also be- 
cause individual interviewers successfully obtain coopera- 
tion from different groups of sample members. 

Despite all its limitations, the strengths of the total survey 
error framework are quite convincing. The framework 
provides a taxonomic decomposition of errors, it separates 
variance from bias and observation from nonobservation, 
and it defines the different steps in the survey process. It 
serves as a conceptual foundation of the field of survey 
methodology, where subfields are defined by their asso- 
ciated error structures. Finally, it identifies the gaps in the 
research literature since any typology will show that some 
process steps are more “popular” than others. Just compare 
the respective sizes of the literatures on data collection and 
data processing. 

It seems, however, as if the total survey error framework 
needs some expansion along lines some of which were 
identified half a century ago. We need some guidance on 
trade-offs between measuring error sizes and making 
processes more error-free. Spencer’s (1985) question 1s: 
how much should we spend on measuring quality versus 
quality enhancement? We also need some guidance on how 
to integrate additional notions into the framework, so that it 
becomes a total survey quality framework rather than a total 
survey error framework (Biemer 2010). For instance, if 
“fitness for use” predominates as a conceptual base, how 
can we launch research that incorporates error variation 
associated with different uses? This aspect will be discussed 
in the next section. 
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3. Quality management philosophies 
in survey organizations 


During the late 1980’s and the early 1990’s some 
statistical organizations were under severe financial pressure 
and in some cases simultaneously criticized for not being 
sufficiently attentive to user needs. Governments in Sweden, 
Australia, New Zealand and Canada as well as the Clinton 
administration in the U.S. were all keen on improving 
efficiency and user influence within their respective 
Statistical systems. It was natural for these organizations to 
look for inspiration in management theories and methods 
(Drucker 1985) and specifically on what was called quality 
management (Juran and Gryna 1988). In that newer 
literature it was possible to study the role of the customer, 
leadership issues, the notion of continuous quality improve- 
ment, and various tools that could help the statistical organi- 
zation improve. Especially influential to survey practitioners 
was work by Deming (1986), since he emphasized the role 
of statistics in quality improvement. He vigorously pro- 
moted the idea that improvement work should be led by 
Statisticians, since they are trained in distinguishing between 
different kinds of process variation. He thought that there 
were too few statistical leaders advising top management in 
businesses and he wanted more proactive statisticians to 
become such leaders. He was especially keen on developing 
Shewhart’s ideas about control charts as a means to distin- 
guish between the different types of variation, namely 
common and special cause variation. Shewhart’s improve- 
ment cycle Plan-Do-Check-Act was also part of Deming’s 
thoughts on quality (Shewhart 1939). 

Management principles have, of course, existed since 
ancient times. Juran (1995) provides lots of examples of 
what was in place in, for instance, the Roman empire. 
Craftsmanship and a guild system were basic building 
blocks. There were methods for choosing raw materials and 
suppliers. Processes were inspected and improved. Workers 
were trained and motivated and customers got warranties. 
All these features are found also in today’s management 
systems. The more modern development includes quality 
frameworks or business excellence models such as Total 
Quality Management (TQM), International Organization for 
Standardization (ISO) standards, the Malcolm Baldrige 
quality award criteria, the European Foundation for Quality 
Management (EFQM), Six Sigma, Lean Six Sigma, and the 
Balanced Scorecard. These models are not totally different. 
They often share a common set of values and common 
criteria for excellence. Rather they represent a natural 
development that can be seen in all kinds of work. 
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Thus, there has been a gradual adoption of quality 
management models and quality strategies in statistical 
organizations and a merging with concepts and ideas 
already used in statistical organizations. My personal 
timeline for this development is the following (readers are 
invited to come up with different sets of events and dates): 


1875 Taylor introduces what he called scientific 
management; 
1900-1930 Taylor’s ideas are used in, for instance, Ford’s 


and Mercedes Benz’s assembly lines; 


1920’s Fisher starts developing theories and methods 
for experimental design; 

1924 Shewhart develops the control chart; 

1940 The U.S. War Department develops a guide for 
analyzing process data; 

1944 Deming presents the first typology of survey 
errors; 

1944 Dodge and Romig present theory and tables for 
acceptance sampling; 

1946 Deming goes to Japan; 

1950 Ishikawa suggests the fishbone diagram as a 
tool for identifying factors that have a profound 
effect on the process outcome; 

1954 Juran goes to Japan; 

1960 Many businesses embark on a zero defects 
program; 

1960 The U.S. Census Bureau quality control pro- 
grams are developed; 

1961 The U.S. Census Bureau survey model is 


launched; 


1965-1966 Kish and Slobodan Zarkovich start talking 
about data quality rather than survey errors; 


1970’s Many statistical organizations provide quality 
guidelines; 

1975 The Total Quality Management (TQM) 
framework is launched; 

1976 The first quality framework in a statistical 


organization containing more dimensions than 
relevance and accuracy; 

1987-1989 Launching of the ISO 9000, Malcolm Baldrige 
Award, Six Sigma and EFQM models; 


1990’s Many statistical organizations start working 
with quality improvement and _ excellence 
models; 

roo The Monograph on Survey Measurement and 
Process Quality; 

1998 Mick Couper introduces the concept “paradata” 


as a subset of process data; 
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2001 The Eurostat leadership group on quality orga- 
nizes the first conference on Quality Manage- 
ment in Official Statistics; 

2007 Business architecture ideas enter the survey 


world. 


From the mid 1990’s and on quality management philo- 
sophies have had an enormous effect on many statistical 
organizations. The effect is not necessarily higher quality 
across the board (no one has checked that). But the philo- 
sophies have led to an awareness in most organizations of 
the importance of good contacts with users and clients, and 
an aspiration in many of them to become “the best” or 
“world class”. Quality is on the agenda. 


3.1 The concept of quality 


During the last decades it has become obvious that 
accuracy and relevance are necessary but not sufficient 
when assessing survey quality. Other dimensions are also 
important to the users. The development of survey quality 
frameworks has taken place mainly within official statistics 
and has been triggered by the rapid technology development 
and other developments in society. These advanced 
technologies have created opportunities and user demands 
regarding potential quality dimensions such as accessibility, 
timeliness, and coherence that simply were not emphasized 
before. Decision-making in society has become more 
complex and global resulting in demands for harmonized 
and comparable statistics. Thus, there is a need for quality 
frameworks that can accommodate all these demands. 
Several frameworks of quality have been developed and 
they each consist of a number of quality dimensions. 
Accuracy and relevance are just two of these dimensions. 

For instance, the framework developed by OECD (2011) 
has eight dimensions: relevance, accuracy, timeliness, 
credibility, accessibility, interpretability, coherence, and 
cost-efficiency (Table 1). Similar frameworks have been 
developed by Statistics Canada (Statistics Canada 2002; 
Brackstone 1999), and Statistics Sweden (Felme, Lyberg 
and Olsson 1976; Rosén and Elvers 1999). The Federal 
Statistical System of the U.S. has a strong tradition in 
emphasizing the accuracy component (U.S. Federal 
Committee on Statistical Methodology 2001) although it 
certainly appreciates other dimensions. Perhaps they are 
viewed as dimensions of a more nonstatistical nature that 
still need a share of the total survey budget. The Interna- 
tional Monetary Fund (IMF) has developed a framework 
that differs from those of OECD, Australian Bureau of 
Statistics, Statistics Sweden, and Statistics Canada. IMF’s 
framework consists of a set of prerequisites and five 
dimensions of quality: integrity, methodological soundness, 
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accuracy and reliability, serviceability, and accessibility (see 
Weisman, Balyozov and Venter 2010). 


Table 1 
OECD’s quality framework 
Dimension Description 
Relevance Statistics are relevant if users’ needs are met. 
Accuracy Closeness between the value finally retained 
and the true, but unknown, population value. 
Credibility The degree of confidence that users place in 
data products based on their image of the data 
provider. 
Timeliness Time length between data availability and the 
event or phenomenon data describe. 
Accessibility How readily data can be located and accessed 


from within data holdings. 


Interpretability The ease with which the data user may 
understand and properly use and analyze the 


data. 


Reflects the degree to which data products are 
logically connected and mutually consistent. 


Coherence 


Cost-efficiency A measure of the costs and provider burden 


relative to the output. 


Without sufficient accuracy, other dimensions are 
irrelevant but the opposite is also true. Very accurate data 
can be useless if they are released too late to affect 
important user decisions or if they are presented in ways that 
are difficult for the user to access or interpret. Furthermore, 
quality dimensions are often in conflict. Thus, providing a 
quality product is a balance act where informed users should 
be key players. Typical conflicts exist between timeliness 
and accuracy, since it takes time to get accurate data 
through, for instance, extensive nonresponse follow-up. 
Another conflict is the one between comparability and 
accuracy since application of new and more accurate 
methodology might disturb comparisons over time (Holt 
and Jones 1998). 

Thus, many organizations have adopted a multi-faceted 
quality concept consisting not only of accuracy but also 
other dimensions. We might talk about a quality vector 
whose components vary slightly between organizations both 
in number and in contents. There are a number of problems 
associated with the quality vector approach. 

First, the development has not been preceded by user 
contacts. Producers of statistics have believed that users are 
interested in a specific set of dimensions even though it is 
obvious that a vast majority of users think that error 
structures are too complicated to grasp and assume that the 
producer should be responsible for delivering the best 
possible accuracy. In cases where the user or client has 
specific accuracy requirements a more in-depth dialog can 
take place between the two. In the rare studies that have 
investigated user perceptions of information on quality it 
turns out that users are mostly interested in dimensions that 
are easily understood, such as timeliness and indicators that 
are seemingly straight forward, such as response rates. The 
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user wants the producing statistical organization to be 
credible, which translates into being capable of producing 
data with small or at least known errors and delivering them 
in a timely, reliable, and accessible fashion. The thought that 
it would be possible to produce a total quality measure 
based on weighted assessments of the different dimensions 
is not realistic, although Mirotchie (1993) argues to the 
contrary. In that paper Mirotchie makes a case for a standard 
set of quality indicators and provides a hypothetical illus- 
tration of indexing data quality indicators and computing an 
actual index (in this illustration the indicators are precision, 
nonresponse, reliability, timeliness and residuals). Even if a 
composite indicator in the form of an index were a possible 
development, the user would like to know which indicators 
contributed most to an index value. From a user’s point of 
view the least favorable index value could still reflect a 
situation providing the highest quality. Rarely can a low 
accuracy be compensated by good ratings on other dimen- 
sions, not even in the case of election exit polls where 
timeliness is imperative. Accuracy is still necessary and 
there is wide agreement that all reputable organizations 
should meet accuracy standards (Scheuren 2001; Kalton 
2001; Brackstone 2001). Phipps and Fricker (2011) provide 
an overview of quality frameworks and literature on total 
survey error. Thus, we can agree that survey quality is a 
multi-faceted concept involving multiple features of a 
statistical product or service. 


3.2 The quality movement’s impact on statistical 
organizations 


Just extending the quality framework from one or two 
dimensions to several is not sufficient to create a quality 
environment. In the late 1980’s and early 1990’s many 
statistical organizations became interested in quality issues 
beyond traditional aspects of data quality. Issues concerning 
customer satisfaction, communicating with customers, com- 
petition, process variability, cost of poor quality, waste, 
business excellence models, core values, best practices, 
quality assurance, and continuous quality improvement 
were suddenly part of the everyday activities in many 
organizations. 

Successful organizations know that continuous improve- 
ment (Kaizen) is necessary to stay in business and they have 
developed measures that help them change. This is true also 
for producers of statistics. Changes that are supposed to 
improve the statistical product are triggered by user de- 
mands, competition from other producers and from produc- 
er values that emphasize continuous improvement as part of 
the general business environment. The measures that can 
help a statistical organization improve are basically identical 
to those of other businesses. They can be built on business 
excellence models such as the European Foundation for 
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Quality Management (EFQM) (1999). The core values of 
the EFQM model include results orientation, customer 
focus, leadership and constancy of purpose, management by 
process measures and facts, personnel development and 
involvement, continuous learning, innovation and improve- 
ment, development of partnerships, and public responsibili- 
ty. This model has been adopted by the European Statistical 
System (ESS) as a tool for national statistical institutes in 
Europe for achieving organizational quality. The thought is 
that good product quality, according to the dimensions 
mentioned (or some other product quality definition) cannot 
be achieved without good underlying processes used by the 
organization. It can also be argued that good product quality 
is achieved most efficiently and reliably by good process 
quality. If we view quality as a three-level concept it can be 
visualized as shown in Table 2. 


3.2.1 Product quality 


The deliverables agreed upon are called the product. It 
can be one or several estimates, datasets, analyses, registers, 
standard processes or other survey materials such as frames 
and questionnaires. Product quality is the traditional quality 
concept used when informing users or clients about the 
quality of the product or service. It can be measured and 
controlled by means of degree of adherence to specifications 
and requirements for product characteristics adding up to 
quality dimensions of a framework. Measures of accuracy 
and margins of error belong here. Also observations 
whether service levels agreements established with the 
client have been accomplished are relevant. In line with 
quality management principles, it is also quite common to 
conduct user satisfaction surveys to find out what users 
think about the products and services that are provided. 


3.2.2 Process quality 


All processes have to be designed so that they deliver 
what they are supposed to. This means that we have to have 
some kind of quality assurance perspective when processes 
are defined. For instance, the process of interviewing 
implies that a number of elements must be in place for the 


Table 2 
Quality as a three-level concept* 
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process to deliver what is expected. Examples of elements 
are an effective selection of interviewers and a training 
program, a compensation system as well as supervision and 
feedback activities. Thus we aim at building quality into the 
process via the quality assurance. Quality control efforts are 
only used to check if the process works as intended. It 
cannot by itself be used to build quality into the process. In 
Section 4.4 this process view is discussed in more detail. 
Process quality is measured and controlled via selection, 
observation and analyses of key process variables, so called 
process data or paradata (Morganstein and Marker 1997; 
Couper 1998; Lyberg and Couper 2005). Theory and 
methods imported from statistical process control can help 
the producer distinguish between the two types of variation, 
common and special cause. As long as all variation is 
contained within the upper and lower control limits 
associated with the control charts chosen, the process is said 
to be in statistical control and no process improvements are 
really possible by trying to adjust individual outcomes. If 
there are observations falling outside of the control limits, 
usually set at 3 sigma, then we have indications of special 
cause variation that should be taken care of so that the 
variation after adjustment is brought back to common cause 
variation. The following P-chart illustrates a possible 
situation: 
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Quality level Main stake-holders Control instrument 


Product User, client Product specs, SLA, evaluation studies, 
frameworks, standards 
Process Survey designer SPC, charts, acceptance sampling, risk analysis, 


CBM, SOP, paradata, checklists, verification 


Organization Agency, owner, 


society self-assessments 


Excellence models, ISO, CoP, reviews, audits, 


Measures and indicators 


Frameworks, compliance, MSE, user surveys 


Variation via control charts, other paradata 
analyses, outcomes of periodic evaluation studies 
Scores, strong and weak points, user surveys, staff 
surveys 


* SLA (Service Level Agreement), SPC (Statistical Process Control), CBM (Current Best Methods), SOP (Standard Operating Procedures), and 


COP (ESS Code of Practice). 
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Thus, the action sequence is the following. First the roots 
of the special causes are taken care of so that these 
variations are eliminated. After that the process displays 
common cause variation only. If that variation is deemed 
too large then the process has to change. The kinds of 
changes necessary are seldom obvious at the outset. Indeed 
perhaps several are necessary to decrease the process 
variation. Typically, a process improvement project 1s 
needed and the quality management literature has promoted 
a number of tools that are useful in such projects. Most of 
these tools are borrowed from statistics (control charts, 
experiments, regression analysis, Pareto diagrams, scatter 
plots, stratification) but there are also tools for identifying 
probable problem root causes (fishbone diagrams, process 
flow charts, brainstorming). The common thinking 1s that 
improvement projects should be “manned” by people 
working with the process or by people very much familiar 
with the process in other ways. Sometimes, we talk about 
forming an improvement team, where also the client or 
customer participates. In any improvement work suggested 
changes have to be tested. When Shewhart first developed 
his control charts he also suggested that improvement work 
should follow a sequence of operations, Plan-Do-Check- 
Act. What this sequence tells us is that any process changes 
suggested should be tested to see if they actually improve 
the process. If not, another change is made, and testing done 
again. Deming called this line of thinking the Shewhart 
cycle but since Deming spent a lot of time promoting it, 
many eventually called it the Deming cycle. The changes 
sought after could be decreased process variation, reduced 
costs, or increased customer satisfaction. The improvement 
project methodology is described in for instance Joiner 
(1994), Box and Friends (2006), Breyfogle (2003), and 
Deming (1986). 

Another way of checking the process quality is to use 
acceptance sampling. Acceptance sampling (Schilling and 
Neubauer 2009) can be applied in situations where process 
elements can be grouped in batches. The batches are 
controlled and based on the outcome of that control it is 
decided whether the batch should be approved or reworked. 
Acceptance sampling plans guarantee an average outgoing 
quality in terms of, say, error rate, but there is no direct 
quality improvement involved. It is a control instrument that 
is suitable for operations such as coding, editing and 
scanning and then only when these processes are not really 
in statistical control. The method has been heavily criticized 
by Deming (1986) and others but can be the only control 
means available in situations where staff turnover is high 
and there is no time to wait for stable processes. 

Global paradata (Scheuren 2001) are “error” rates of 
different kinds. Examples include nonresponse rates, coding 
error rates, scanning error rates, listing error rates, efc. In 
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some operations the error rates are calculated using 
verification, which means that the operation is repeated in 
some way. That is the case for the coding operation. In other 
operations the calculation can be based on a classification 
scheme, which is the case for nonresponse rate calculation. 
These global paradata tell us something about the process. 
They are process statistics, i.e., summeries of data. A large 
nonresponse rate indicates problems with the data collection 
process and a high coding error rate indicates problems with 
the coding process. From these summaries it is sometimes 
possible to distinguish common and special cause variation 
and decide what action to take. 

Some standardized processes can be controlled by means 
of simple checklists. Checklists are very effective when it is 
crucial that every process step is made and in the right order 
(Morganstein and Marker 1997). This is the case when 
airline pilots prepare for take-off. No matter how many 
times they have taken off, without a checklist the day will 
come when they forget an item. In statistics production 
sampling is such a process, albeit with less severe 
consequences if items are missed. It might very well be the 
case that a statistical organization has a standardized process 
for sample selection and a checklist that can be used as a 
combination of work instruction and control instrument. 

There is a kind of checklist that can be used in more 
creative processes such as the overall survey design process. 
It is not possible to standardize the survey design process 
but it is possible to list a number of critical steps that always 
must be addressed. The list does not tell us how to address 
them. It just serves as a reminder that an individual step 
should not be omitted or forgotten. Morganstein and Marker 
(1997) discuss this kind of checklist and call them (and the 
simpler checklists) Current Best Methods (CBM). They 
describe the CBM development process and how the CBMs 
can be used to decrease the process variation in statistical 
organizations. For instance, an organization might have 
seven different imputation methods and systems in its 
toolbox. It is costly to maintain these seven systems. It is 
unlikely that they are equally efficient. If they are, it may 
not be economically feasible to keep them all. In this 
situation a CBM that describes fewer options to the 
organization seems like a good idea. This could be 
accomplished by forming an improvement team consisting 
of the imputation experts and some clients. CBMs are 
supposed to be revised when new knowledge is obtained, 
which implies that there is an expiration date associated 
with every CBM. 

CBMs are of course “best practices” in some sense. 
Many organizations want best practices implemented and 
used. Morganstein and Marker offer a process for 
developing these best practices and keeping them current. It 
is beneficial for an organization if the variation in process 
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design can be kept at a minimum. It then becomes easier to 
train people and change the process when it becomes 
unstable or when new methods are developed. On the other 
hand, if CBMs and other standards are not vigorously 
enforced within an organization, they will not be widely 
used and the investment will not pay off. 


3.2.3 Organizational quality 


Management is responsible for quality in its widest 
sense. It is the organization that provides leadership, 
competence development, tools for good customer relations, 
investments, and funding. The quality management field has 
given us business excellence models that can help us 
evaluate our statistical organizations in the same way other 
businesses are evaluated. The two main business excellence 
models are the Baldrige National Quality Program and the 
European Foundation for Quality Management (EFQM). 

These models consist of criteria to be checked when 
assessing an organization. The Malcolm Baldrige award 
uses seven main criteria: Leadership, strategic planning, 
customer and market focus, information and analysis, 
human resource focus, process management, and business 
results. Each criterion has a number of subcriteria. For 
instance, human resource focus consists of work systems, 
employee education, training and development, and 
employee well-being and satisfaction. The EFQM model 
has nine criteria: Leadership, strategy, people, partnerships 
& resources, processes, products & services, customer 
results, people results, society results, and key results. These 
models can be used for self-assessment or external 
assessment. The organization provides a description of what 
is in place regarding each criterion and the organization is 
scored based on that description. Typically self-assessments 
result in higher scores than external ones. It is very difficult 
to get a high score from external evaluators since the models 
are very demanding. For each criterion the organization is 
asked if it has a good approach in place somewhere in the 
organization. This is often the case. The next question is 
how wide-spread this good approach is within the organiza- 
tion. Many organizations lose momentum here, since there 
is very little truth in the mantra “the good examples are 
automatically spread throughout an organization”. Instead 
good approaches usually have to be vigorously promoted 
before they are accepted within the organization. The third 
question asks whether the approach is periodically evaluated 
to check if it achieves the results expected. This is where 
most organizations fail. Their usual strategy is to exhaust an 
approach until the problems are so great that the approach 
has to be replaced rather than adjusted. This strategy is, of 
course, disruptive and expensive and does not score highly 
in excellence assessments. The maximum number of points 
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that can be obtained using these models is 1000 and very 
rarely does a winner get more than 450-600 points, which is 
an indication that there is a lot of room for improvement 
even in world class organizations. 

Some statistical organizations have used business excel- 
lence models for assessment. The Czech Statistical Office 
was announced Czech National Quality Award Winner for 
2009 in the Public Sector category based on EFQM. The 
office got 464 points. Eurostat’s leadership group on quality 
recommended the European national statistical offices to use 
the EFQM as a model for their quality work and Finland 
and Sweden are among those that have done so. Since the 
leadership group released its report in 2001 (see Lyberg, 
Bergdahl, Blanc, Booleman, Griinewald, Haworth, Japec, 
Jones, Korner, Linden, Lundholm, Madaleno, Radermacher, 
Signore, Zilhao, Tzougas and van Brakel 2001) other frame- 
works and standards have been developed. The European 
Statistical System has launched its Code of Practice, which 
consists of a number of principles with associated indica- 
tors. Regarding some principles, however, the indicators are 
more like clarifications. The list of principles resembles 
other lists that have been developed by the UN and other 
organizations. 

External assessments are probably more reliable than 
internal ones. There are a number of reasons for that. One is 
that it is difficult to criticize your peers since you have to 
interact with them in the future or if your own product or 
service will be assessed by those peers in the future. 
Experiences from Statistics Sweden and Statistics Canada 
show that self-assessments are limited in their capability of 
identifying serious weaknesses (see Section 5.3). 


3.2.4 Some specific consequences for statistical 
organizations 


Most statistical organizations have adopted quality man- 
agement ideas to varying degrees and with varying success. 
As pointed out by Colledge and March (1993) it is possible 
to list a number of obstacles associated with such implemen- 
tation. For a government agency it can be difficult to 
motivate its staff through monetary incentives, since there 
are restrictions on how tax money can be spent. The variety 
of users and products makes the dialog between the service 
provider and the user complicated and as mentioned neither 
the users, or for that matter the providers are totally familiar 
with all the biases and other quality problems that are 
present in statistics production. The effect of errors on the 
uses can vary and are often unknown. To complicate 
matters further, unlike most other businesses, suppliers are 
not very enthusiastic. In other businesses suppliers get paid 
while statistical organizations must motivate theirs, the 
respondents, who are seldom even given a cash incentive. 
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On the other hand statistical organizations have a great 
advantage when it comes to applying quality management 
principles. A statistical organization knows how to collect 
and analyse data that can guide improvement efforts. One of 
the cornerstones in quality management philosophies is that 
decisions should be based on data and businesses that do not 
have support from statisticians are often unaware of data 
quality problems, which can have consequences for their 
decision-making. By and large, though, a statistical orga- 
nization is not different from any other business and it is 
quite possible to apply quality management ideas to 
improve all aspects of work. 


4. Examples of quality initiatives 
in statistical organizations 


In this section we will provide some examples of 
initiatives that statistical organizations have engaged in as a 
result of a general interest in quality in society. 


4.1. The total survey error 


Perhaps the most important thing to notice is that 
research and development in survey design, implementa- 
tion, sampling and nonsampling errors, and the effect of 
errors on the data analysis continue to thrive. Data with 
small errors is the major goal for reputable organizations, 
which is indicated by the steady flow of textbooks on data 
collection, sampling, nonresponse, questionnaire design, 
measurement errors, and comparative studies. New text- 
books are in progress covering gaps such as business 
surveys, translation of survey materials, and paradata. There 
are journals such as the Journal of Official Statistics, Survey 
Methodology, and Survey Practice that are entirely devoted 
to topics related to statistics production in a wide sense. 
Numerous other journals such as the Public Opinion 
Quarterly, the Journal of the American Statistical 
Association, and the Journal of the Royal Statistical Society 
devote much space to survey methods. The Wiley series on 
Survey Methodology and its associated conferences (on 
panel surveys, telephone survey methods (twice), measure- 
ment errors, process quality, business surveys, testing and 
evaluating questionnaires, computer assisted survey infor- 
mation collection, nonresponse, and comparative surveys) 
have been very successful and that is the case also for the 
continuing workshops on nonresponse and total survey 
error. Thus, there is no shortage of ideas regarding specific 
error sources and their treatment. Admittedly there are areas 
that are understudied such as specification errors, data 
processing errors and the impact of errors on the data 
analysis but by and large there is a healthy interest in 
knowing more about survey errors. The challenge lies in 
communicating this knowledge to people working in 
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statistical organizations and in developing design principles 
that can be used to improve statistics production. There is a 
noticeable gap between what is known through research and 
what is known and applied in the statistical organizations. 
Thus, staff capacity building seems to be a continuing need, 
especially since the common idea that good examples 
spread like ripples within and between organizations is a 
myth. If that indeed were the case quality would by now be 
fantastic everywhere. Since it is not, many organizations 
have developed extensive training programs (Lyberg 2002). 


4.2 Risk and risk management 


One element of quality management that has entered the 
survey world is risk and risk management. Eltinge (2011) 
even talks about Total Survey Risk as an alternative to the 
total survey error paradigm. The identification and 
management of risks is an important part of modern internal 
auditing (Moeller 2005) and is perhaps the only major 
element that is missing in quality management frameworks 
such as EFQM. An error source can be seen as more risky 
than another and should, therefore, be handled with more 
care and resources than another less risky. For instance, not 
having an effective system for statistical disclosure control 
is seen as a very risky situation. Unlawful data disclosure is 
very rare historically, but when it happens it could 
potentially destroy future data collection attempts. Certain 
design decisions can be seen as risky. For instance, if we 
choose a data collection method that does not fit the survey 
topic we might get estimates that are so far from the truth 
that the results are useless. An example might be to study 
sensitive behaviors using face to face or telephone inter- 
viewing instead of a self-administered mode. There are also 
technical risks that need to be identified and assessed. For 
instance, the U.S. National Agricultural Statistical Service 
(Gleaton 2011) like many others has plans for disaster 
recovery. Groves (2011) and Dillman (1996) both discuss 
how the production culture and the research culture within a 
statistical organization might view risks in different ways. 
Change in statistical organizations is generally slow and 
there are sometimes good reasons for that. Change might 
result in failures such as unsuccessful implementation, large 
costs and decreased comparability. So in some sense both 
producers and users have a tendency to be hesitant toward 
changes suggested by researchers and innovators and that 
might be one reason why change takes a long time. It is very 
common to have parallel measurements for some time to 
handle risks associated with implementing a new method or 
system. According to Groves (2011) the production culture 
and the users have had the final say about any changes, at 
least up until now. At the same time innovation is badly 
needed in many production systems and there are examples 
of stove-pipe organizations that do not have much time left 
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(to remain unchanged) because the resources to maintain 
their systems are simply not there. So even though there 
is resistance against change, lack of resources and 
competition will make sure that statistical organizations 
become more process-oriented and efficient. Reducing 
the number of systems and applications and developing 
and using more standardization seem to be one road 
forward. 


4.3 The client/customer/user 


The advent of quality management ideas in statistical 
organizations has made the receivers of statistical products 
and services more visible. Commercial firms have always 
talked about the client or the customer while government 
organizations have tended to call them users. In any case the 
recognition of someone who is supposed to use the 
endproducts has not been obvious to some providers. 
Admittedly the user has been a speaking partner since the 
beginning of the survey industry. In the U.S., conferences 
for users were quite frequent already SO years ago (Dalenius 
1968; Hansen and Voight 1967). During six months 
1965-66, for example, the U.S. Census Bureau organized 23 
user conferences across the country and there were also 
advisory groups. The advisory nature of contacts with users 
has preyailed in many countries. The user conference format 
still exists but user input is now complemented by other 
means such as public discussions and internet forums. 
Rarely have users been directly involved in the planning and 
design of surveys. Even when it comes to discussions about 
the quality of data, producers have acted as stand-in users. 
The quality frameworks are a good example. The quality 
dimensions were defined with minimal consultation with 
users. The literature on how users perceive information 
about quality is extremely limited (Groves and Lyberg 
2010). Also, we do not know if the information on quality 
that we provide is useful to them (Dalenius 1985b). In fact, 
an educated guess is that many times it is not. In many 
surveys the users are many and sometimes unknown and 
their information and analytical needs cannot be foreseen 
ahead of time. It is often possible to single out one or a few 
main users to communicate with, but many of the design 
and quality problems are so complicated that a vast majority 
of users expect the service provider to deliver a product with 
the smallest possible error. Hansen and Voight stated that 
accuracy should be sufficient to avoid interpretation 
problems. Today there seems to be consensus among many 
that what users are interested in are products and services 
that can be trusted, ie., the service provider should be 
credible. It is impossible for most users to check levels of 
accuracy. Aspects that an average user can discuss are issues 
such as timeliness, accessibility and relevance. Detailed 
discussions about technical matters and design trade-off 
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issues including accuracy and comparability are more 
difficult to have. 

During recent decades the user has indeed become more 
prominent. Some organizations develop service level 
agreements together with a main user or client, where 
requirements of the final product or service are listed and 
can be checked at the time of delivery. Many organizations 
conducting business surveys have created units that 
continuously communicate with the largest businesses, since 
their participation and provision of accurate information is 
absolutely essential for the estimation process (Willimack, 
Nichols and Sudman 2002). The large businesses are not 
users in the strict sense. They are important suppliers often 
with an interest in the survey results. Another common 
communication tool is the customer satisfaction survey. The 
value of such surveys is limited due to the acquiescence 
phenomenon and problems finding a knowledgeable re- 
spondent who is also willing to respond. Also, many 
customer satisfaction surveys are based on self-selection 
resulting in zero inferential value. In those surveys the 
results can only be viewed as lists of issues and concerns 
that some customers convey. Such information can, of 
course, be very valuable but is not suitable for estimation 
purposes. Many survey organizations now conduct user 
surveys on a continuing basis (Ecochard, Hahn and Junker 
2008). 


4.4 The process view 


Quality management has reemphasized the importance of 
having a process view in statistics production. To view the 
production process as a series of actions or steps towards 
achieving a particular end that satisfies a user, leads to a 
good product quality. Process quality is an assessment of 
how far each step meets defined requirements or specifi- 
cations. One way of controlling the process quality is to 
collect process data that can vary with each repetition of the 
process. The interesting process variables to monitor are 
those that have a large effect on the process’s end result. 
Thus to check a process for stability and variation we need 
mechanisms for identifying, collecting and analysing these 
key process variables. The quality management science has 
given us tools such as the Ishikawa fishbone diagram to 
identify candidates for key process variables. The statistical 
process control methodology has given us tools to distin- 
guish between special and common cause variation and how 
to handle these two variation types. Usually we use control 
charts originally developed by Shewhart (Deming 1986; 
Mudryk, Burgess and Xiao 1996) to make those distinc- 
tions. Then, again, we use methods from quality manage- 
ment to adjust the process if necessary. Examples include 
flowcharts, Pareto diagrams, and other simple means for the 
production team to identify the root causes of problems 
(Juran 1988). 
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Process data have been used to check on processes used 
in statistics production since the 1940’s, first within the U.S. 
Census Bureau and then at Statistics Canada and to some 
extent also in other agencies. Typical processes that were 
checked included coding, keying and printing and the 
process data were mainly error rates. Some of the process 
checks used at the U.S. Census Bureau were so complicated 
and expensive that their value was questioned (Lyberg 
1981), especially since the associated feedback loops were 
inefficient and not always aiming for the root causes of the 
errors. It was common that operators were blamed for 
system problems and at the time there was no emphasis on 
continuous quality improvement. The thinking at the time 
was more directed toward verification and correction. 

Morganstein and Marker (1997) developed a generic 
plan for process continuous improvement that can be used 
in statistics production. They had worked in many statistical 
organizations since the 1980’s and observed that quality 
thinking was not really developed in most of them. Their 
generic plan was built on their first-hand experiences and 
the general quality management ideas laid out by e.g., Juran 
(1988), Deming (1986), Box (1990), and Scholtes, Joiner 
and Streibel (1996). In essence the plan consists of seven 
steps: 


The critical product characteristics are identified 
together with the user, both broad and more single 
effort needs. 

A map of the process flow is developed by a team 
familiar with the process. The map should include the 
sequence of process steps, decision points and 
customers for each step. 

The key process variables are identified among a larger 
set of process variables. 

The measurement capability is evaluated. It is important 
that decisions are based on good data, not just data. 
Available data might be useless. This is an area where 
statistical organizations should have an advantage over 
other organizations. One should not reach conclusions 
about process stability without knowledge about 
measurement errors. Above all, data should allow 
quantification of improvement. 

The stability of the process is determined. The 
variability pattern of the process data is analyzed using 
control charts and other statistical tools. 

The system capability is determined. If stability is not 
achieved after special cause variation has been 
eliminated an improvement effort is called for. System 
changes must be made when the process variation is so 
large that it does not meet specifications, such as 
minimum error rates or production deadlines. Typical 
methods to reduce variation are the development and 
implementation of a new training program or the 


19 


enforcement of a standard operating procedure. The 
latter can be a process standard, a current best methods 
standard or a simple checklist. 

The final step of the improvement plan is to establish a 
system for continuous monitoring of the process. We 
cannot expect processes to remain stable over time. For 
many reasons they usually start drifting after some time. 
A monitoring system helps keeping track of new error 
structures, new customer requirements, and the 
potential of improved methods and technology and can 
suggest process improvements. 


The Morganstein and Marker book chapter had a distinct 
effect on quality work and process thinking in many 
European statistical organizations. Interest in these issues 
increased and some organizations started their own quality 
management system where process improvement was 
central. 

At the 1998 Joint Statistical Meetings Mick Couper 
presented an invited paper on measuring quality in a CASIC 
environment. He meant that the new technology generated 
lots of by-product data that could be used to improve the 
data collection process. He named those paradata, not in his 
paper but in his session presentation. This naming caught on 
very quickly in the survey community and it made sense to 
define the trilogy data, metadata, and paradata. Thus we had 
one term for data about the data (metadata) and another for 
data about the process (paradata). Obviously paradata are 
process data but for a long time paradata were confined to 
data about the data collection process, while the term used 
in many European statistical organizations was “process 
data” and took all survey processes into account (Aitken, 
Homegren, Jones, Lewis and Zilhao 2004). Recently a 
renewed broadening of the meaning of the concept has 
taken place. Kennickell, Mulrow and Scheuren (2009) 
remind us about what they call macro paradata, global 
process data such as response rates, coverage rates, edit 
failure rates, and coding error rates that always have been 
indicators of process quality in statistical organizations. 
Lyberg and Couper (2005), Kreuter, Couper and Lyberg 
(2010), and Smith (2011) also use the more inclusive 
meaning of paradata where other processes than data 
collection are taken into account. There is a risk that 
paradata, like quality, becomes an overused concept. There 
are examples of discussions where all data, apart from the 
survey estimates, are considered paradata, which, of course, 
does not make sense. 

Paradata is a great naming and they are necessary to 
judge process quality. However, a word of caution is in 
place. One should never collect paradata that are not related 
to process quality and it is important to know how to 
analyze them. Sometimes statistical process control methods 
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can be used but at other times other analytical techniques are 
needed. For instance, to be able to control interviewer 
falsification we might need to look at several processes 
simultaneously, but theory and methodology for such 
analysis might not be readily available. 

The expanded use of microdata that concern individual 
records, such as keystroke data and flagged imputed 
records, is an effect of using new technology. Modern data 
collection procedures generate enormous amounts of these 
kinds of paradata but so do systems for computer-assisted 
manual coding and systems for pure automated coding as 
well as systems for scanning of data. It makes no sense to 
confine the concept to data collection. 

Quality management has taught us to prevent process 
problems rather than fix them when they appear, that it is 
important to distinguish between different types of process 
variation since they require different actions, that any 
process intervention or improvement should be based on 
good data and proper analysis methods, and that even stable 
processes eventually start drifting, which calls for contin- 
uous monitoring. 


4.5 Standardization and similar tools 


One way of keeping process quality in control is to 
reduce variation by encouraging the use of standards and 
similar documents. Colledge and March (1997) discuss four 
classes of documents. 


A standard is a document that should be adhered to 
almost without exception. Deviations are not recom- 
mended and require approval of senior management. 
Corrective action should be taken when a standard is 
not fully met. An organization can become certified 
according to a standard. This is the case for ISO 
standards, where a few are relevant to statistical 
organizations. 

A policy should be applied without exceptions. For 
instance, an organization can have a policy regarding 
the use of incentives to boost response rates. 

Several organizations have developed guidelines for 
different aspects of the statistics production. Typically, 
guidelines can be skipped if there are “good” reasons to 
do so. 

A recommended practice is promoted but adherence is 
not mandatory. 


Admittedly, the categories of this classification scheme 
are not mutually exclusive, especially if we also take 
language and cultural aspects into account. For instance, in 
the Swedish language policies and guidelines are very close 
conceptually. If we consult the unauthorized but consensus 
based Wikipedia it says that “policies describe standards 
while guidelines outline best practices for following these 
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guidelines”. This sentence contains three of the categories 
mentioned by Colledge and March. It is probably best to 
relate to these different kinds of documents in a similar 
fashion. They all attempt to improve quality by reducing 
various types of variation and we should not dwell too much 
on what they are called. 

Although standards have been an important part of 
survey methodology for a long time they have gained 
momentum since statistical organizations became interested 
in quality management. Early standards such as Hansen 
etal. (1967) and U.S. Bureau of the Census (1974) 
concentrated on discussing the presentation of errors in data. 
At the U.S. Census Bureau all publications should inform 
users that data were subject to error, that analysis could be 
affected by those errors, and that estimated sampling errors 
are smaller than the total errors. For major surveys the 
nonsampling errors should be treated in more detail unlike 
in the past. Many other statistical organizations imported 
this line of thinking. For instance, the quality frameworks 
mentioned earlier are expansions including also other 
quality dimensions than accuracy. The European Statistical 
System has successively developed and launched what was 
first called Model Quality Reports and currently just 
Standard for Quality Reports (Eurostat 2009a). The standard 
provides recommendations to European National Institutes 
(notice the conceptual complexity) for preparation of quality 
reports for a “full” range of statistical processes and their 
outputs. The standard treats the basic quality dimensions 
relevance, accuracy, timeliness, accessibility, coherence and 
comparability. 

Let us look at some examples. Regarding measurement 
error, which is part of the accuracy component, the standard 
says that the following information should be included in a 
quality report: 

Identification and general assessment of the main risks 
in terms of measurement error. 

If available, assessments based on comparisons with 
external data, reinterviews or experiments. 

Information on failure rates during data editing. 

The efforts made in questionnaire design and testing, 
information on interviewer training and other work on 
error reduction. 

Questionnaires used should be annexed in some form. 


Regarding timeliness the standard says that the following 
information should be included: 


For annual or more frequent releases: the average 
production time for each release of data. 

For annual and more frequent releases: the percentage 
of releases delivered on time, based on scheduled 
release dates. 

The reasons for nonpunctual releases explained. 
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There are also sections on how to communicate informa- 
tion regarding trade-offs between quality dimensions, 
assessment of user needs and perceptions, performance and 
cost, respondent burden as well as confidentiality, transpar- 
ency and security. Even though there is a section on user 
needs and perceptions, users have obviously not been 
involved in the preparation of the standard itself. We still 
know very little about how users perceive and use infor- 
mation about quality. The standard is backed by a much 
more detailed handbook for quality reports (Eurostat 2009b) 
and both documents are built around the 15 principles listed 
in the European Statistics Code of Practice, which is the 
basic quality framework for the European Statistical Sys- 
tem. The Code of Practice principles concern professional 
independence, mandate for data collection, adequacy of 
resources, quality commitment, statistical confidentiality, 
impartiality and objectivity, sound methodology, appro- 
priate statistical procedures, nonexcessive burden on 
respondents, cost-effectiveness, relevance, accuracy and re- 
liability, timeliness and punctuality, coherence and compa- 
rability, and, finally, accessibility and clarity. Each principle 
is accompanied by a set of indicators that the individual 
organization can measure to establish whether it meets the 
Code or not. Some indicators are vague and very subjective 
in nature such as “the scope, detail and cost of statistics are 
commensurate with needs”, while others are more specific, 
such as “a standard daily time for the release of statistics is 
made public”. Peer reviews of compliance to a limited set of 
the principles have been conducted using an earlier version 
of the Code and, not surprisingly, many national statistical 
offices in Europe have problems living up to the Code 
(Eurostat 201la). Therefore in order to assist the 
implementation of the Code a supporting framework has 
been developed, called the Quality Assurance Framework 
(QAF) that contains more specific guidance regarding 
methods and references (Eurostat 2011b). This seems to be 
a very useful document since its references are mainly 
summaries of the state-of-the-art in areas such as sampling, 
questionnaire design, editing and so on, which stimulates 
conformity to current best practices. 

The Code of Practice has many similarities with the UN 
Fundamental Principles of Official Statistics (de Vries 
1999). The latter promotes also the principle of international 
cooperation and coordination, which is, to a large extent, an 
element that is missing in today’s development of statistics 
production (Kotz 2005). Even neighbouring countries can 
have very different approaches and methodological compe- 
tence levels and the differences are sometimes difficult to 
explain. Experience shows that global development collabo- 
ration is difficult to achieve. We meet, we talk, and we bring 
back ideas that might fit our own systems. It is harder to 
agree on common approaches. One global standard that 
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relates to statistics production is the ISO 20252 on market, 
opinion and social research (International Standards 
Organization 2006). This is a process standard with around 
500 requirements concerning the research activities within 
an organization. It is a minimum standard for what to do 
rather than how to do things. It is suitable for organizations 
that conduct surveys and the organization can apply for 
certification. In April 2010 more than 300 organizations 
world-wide had been certified, most of them marketing 
firms. One national statistical office (Uruguay) was certified 
in 2009 and Statistics Sweden is planning a certification in 
2013 but those are the only national offices that have chosen 
this path. The standard concerns the organization’s system 
for quality management, management of the executive 
elements of the research, data collection, data management 
and processing, and reporting on research projects (Blyth 
2012). 

The standards of the U.S. Federal Statistical System 
concentrate on the accuracy component. Although not 
formally a standard the U.S. Federal Committee on 
Statistical Methodology (2001) suggests various methods 
for measuring and reporting sources of error in surveys. In 
2002 the U.S. Office of Management and Budget (OMB) 
issued information quality guidelines (OMB 2002) whose 
purpose was to ensure and maximize the quality, objectivity, 
utility, and integrity of information disseminated by federal 
agencies. OMB (2006a) has also issued standards and 
guidelines for surveys. They are built in a standard fashion. 
First comes a standard such as “Response rates must be 
computed using standard formulas to measure the 
proportion of the eligible sample that is represented by the 
responding units in each study, as an indicator of potential 
nonresponse bias”. This standard is then followed by a 
number of guidelines on how to make the necessary 
calculations while the final guideline states that “If the 
overall nonresponse rate exceeds 20%, an analysis of the 
nonresponse bias should be conducted to see whether data 
are missing completely at random”. As in the case of the 
ESS standards, the OMB guidelines are complemented by a 
supporting document (OMB 2006b) that can facilitate 
adherence to the standards. 

Most agencies in the decentralized U.S. Federal 
Statistical System have documents in place that adapt the 
OMB guidelines. For instance, the U.S. Census Bureau has 
its own statistical quality standards that goes into more 
technical detail compared to the OMB documents. Each 
standard is described via requirements and sub-requirements 
and they often provide very specific examples of studies that 
can be conducted. Examples of other U.S. agencies that 
have standards related to the quality of information 
disseminated include the National Center for Health 
Statistics, National Center for Education Statistics, and the 
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Energy information Administration. All these standards can 
be downloaded from the agencies’ websites. 

Statistics Canada has issued quality guidelines since 
1985. They are similar to the ESS guidelines since not just 
accuracy is emphasized. But they are much more detailed 
and contain lots of references. A special feature is that for 
some processes the guidelines prescribe the use of statistical 
process control. No other agency seems to be doing that. 
The latest edition of the guidelines is provided in Statistics 
Canada (2009). 

Many other statistical organizations in the world have 
their own quality standards. They are sometimes described 
as guidelines or standards and sometimes as_ business 
support systems or quality assurance frameworks. In any 
case, the contents and style vary across organizations but the 
variation should be manageable. It should be possible to 
achieve higher degrees of standardization globally, since 
that has happened in other fields, such as air travel. Apted, 
Carruthers, Lee, Oehm and Yu (2011) discuss various ways 
to industrialize the statistical production process at the 
Australian Bureau of Statistics. 

The question is whether international standards would 
benefit survey quality in general. Some areas where 
standards would be beneficial include computation of 
frequently used quality indicators such as error rates and 
design effects, as well as best practices for translation of 
survey materials, handling non-native language respondents, 
and weighting for nonresponse. One must bear in mind that 
once a standard is issued it has to be continually updated 
and it is well-known that they can be difficult to enforce. If 
they are comprehensive, standards can overwhelm the 
practitioner and, as a result, unless mandated and audited, 
they are largely ignored. 


4.6 Statistical business process models 


During recent years concepts like business process 
models and business architecture have become part of 
quality work in some statistical organizations. To make 
production processes more efficient and flexible they can be 
seen as part of a business architecture model (Reedman and 
Julien 2010). In statistics production a generic statistical 
process model is jointly developed by UNECE, Eurostat, 
and OECD. Any system redesign should be driven by 
customer demands, risk assessments and new developments. 
The architectural principles behind this thinking are 
summarized in Doherty (2010), which discusses architecture 
renewal at Statistics Canada. 


Some of the principles are: 


Decision-making should be corporately optimal, which 
entails centralization of informatics, methodology 
support and processing. 
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Use of corporate services such as collection, data 
capture and dissemination should be optimized. 

Reuse should be maximized by having the smallest 
possible number of distinct business processes and the 
smallest possible number of computer systems. 

The corporate toolkit should be minimized. 

There should be staff proficiency in tools and systems. 
Rework such as repeated editing should be eliminated. 
The focus should be on the core business and the work 
with support processes should be outsourced. 
Development should be separated from the on-going 
operations. 

Electronic data collection should be viewed as the 
initial mode. 

Structural obstacles, such as overlapping or unclear 
mandates should be removed. 


These principles are very similar to those we identify 
when we apply quality management principles from the 
various frameworks and excellence models described pre- 
viously. The principles represent a move from decen- 
tralization to more corporate level thinking. Many statistical 
organizations realize that stove-pipe thinking is a thing of 
the past and that a move to more centralization is necessary. 


5. Measuring quality 


Thus, quality is a multi-faceted concept and measuring it 
is a complicated task. We have noted that survey quality can 
be viewed as a three-dimensional concept associated with 
the final product, the underlying processes that lead to the 
product, and the organization that provides the means to 
carry out the processes and deliver the product or service in 
a successful way. There are basically two ways to measure 
quality. One is to directly estimate the total survey error or 
some components thereof. The other is to measure 
indicators of quality with the hope that they indeed reflect 
the concept itself. 


5.1 Direct estimates of the total survey error 


The existing decompositions of the mean squared error 
described in, for instance, Hansen etal. (1964), Fellegi 
(1964), Anderson, Kasper and Frankel (1979), Biemer and 
Lyberg (2003), Weisberg (2005), and Groves et al. (2009) 
are all incomplete in the sense that they do not reflect all 
error sources. It is seldom possible to compute the MSE 
directly in practical survey situations because this usually 
requires a parameter estimate that is essentially error free. 
However, it is possible to obtain a second best estimate of 
the true parameter value if there are resources available to 
collect data using some “gold standard” methodology that is 
not affordable or practical in a normal survey setting. This is 
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the standard evaluation methodology when the true 
parameter value can be uniquely defined. Gold standard 
methods are seldom error-free but they can to varying 
extents provide better estimates, and the difference between 
the regular estimate and the gold standard estimate can serve 
as an estimate of the bias, which is the methodology used in 
census post enumeration surveys (United Nations 2010). 
Often an evaluation concerns a specific error component 
such as census undercount, nonresponse bias, interviewer 
variance or simple response variance, since we want 
information not on total survey error per se but rather on the 
components’ relative contribution to the total survey error so 
that root causes of problems can be identified and relevant 
processes improved. Large evaluation studies are very rare 
since they are so demanding and their value is sometimes 
questioned (United Nations 2010). Smaller regular 
evaluation studies, on the other hand, are necessary to get 
indications of process and methodological problems. 


5.2 Indicators of quality 


Continuing reporting of total survey error is a formidable 
task and no survey organization does that. Instead 
organizations provide indicators or statements regarding 
quality. For instance, according to Eurostat’s (2009a) 
handbook for quality reports the following indicators should 
be measured: 


Coefficient of variation; 
Overcoverage rate; 

Edit failure rate; 

Unit response rate; 

Item response rates; 
Imputation rates; 
Number of mistakes; 
Average size of revisions. 


The common theme here is that these paradata summary 
items are indicators that can be calculated without 
conducting special studies. The set of indicators that can be 
calculated directly from the survey data is by definition 
quite limited and their value questionable. For instance, to 
include overcoverage but not undercoverage just because 
only the former can be calculated directly from the available 
data does not make sense. It is undercoverage that poses the 
greatest coverage problem in surveys. Admittedly, the 
handbook prescribes the producer to assess the potential for 
bias (both sign and magnitude) but it is not clear how this 
should be accomplished. The producer is urged to include 
evaluation and quality control results, if such information 
exists as well. Level of effort measures for processes such as 
questionnaire design and coder training would be wel- 
comed. There is no standard reporting format for such 
qualitative and quantitative information. In any case, the key 
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indicator list becomes severely limited when compared to 
the full list of main error sources and it is hard to see how 
they are perceived by the users and how they can be used by 
the producer to improve the process. 

The producer needs a more complete list of indicators to 
be able to measure or assess various levels of quality to 
make sure that the design implementation is in control or to 
be able to mount a quality improvement project. The initial 
survey design must be modified or adapted during the 
implementation to control costs and maximize quality. 
Biemer (2010) discusses four strategies for reducing costs 
and errors in real time, i.e., continuous quality improvement 
(CQD, responsive design (Groves and Heeringa 2006), Six 
Sigma (Breyfogle 2003), and adaptive total design and 
implementation. 

When the continuous quality improvement strategy is 
used, key process variables are identified and so are process 
characteristics that are critical to quality (CTQ). For each 
CTQ, real-time, reliable metrics for the cost and quality are 
developed. The metrics are continuously monitored during 
the process and intervention is done to ensure that costs and 
quality are within acceptable limits. The responsive design 
strategy was developed to reduce nonresponse bias in face 
to face interviewing. It includes three phases. In the 
experimental phase a few design options are tested (e.g., 
regarding incentive level). In the main data collection phase 
the option chosen in the experimental phase is implemented 
and the implementation continues until phase capacity is 
reached. In the nonresponse follow-up phase special 
methods are implemented to reduce nonresponse bias and 
control the data collection costs. Such methods include the 
Hansen-Hurwitz double sampling scheme, increased 
incentives, and using more experienced interviewers. Again 
the efforts continue until further reductions of the 
nonresponse bias are no longer cost-effective. Six Sigma is 
the most developed business excellence model since it relies 
so heavily on statistical methods. It contains a large set of 
techniques and tools that can be used to control and improve 
processes. Adaptive total design and implementation 
combines control features of CQI, responsive design and 
Six Sigma and does that so that it simultaneously monitors 
multiple error sources. Biemer and Lyberg (2012) give 
several examples of CTQs and metrics for various survey 
processes. For instance, regarding the measurement process 
attributes that are CTQs might include the abilities to 
identify and repair problematic survey questions, to detect 
and control response errors, and to minimize interviewer 
biases and variances. Corresponding metrics might include 
missing data item by question, refusal rate by size of 
business, results of replicate measurements, suspicious edits 
actually changed, and field work results by interviewer. The 
metrics can be analyzed using statistical process control or 
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analysis-of-variance methodologies. Different related 
metrics can be displayed together in a dashboard fashion. 
For instance if one CTQ is the ability to discover inter- 
viewer cheating we might want to have a dashboard 
showing the metrics average interview length by interviewer 
and the distribution of some sensitive sample characteristic, 
also by interviewer. 


5.3 Self-assessments and audits 


The quality management philosophy has introduced the 
concepts of self-assessment and audit into statistics 
production. We are anxious to know what users, clients, 
owners and other stakeholders think about the products and 
services provided by the statistical organization. There are a 
number of tools available for this kind of evaluation. We 
have already mentioned the customer satisfaction survey. 
Other tools include employee surveys, internal audits and 
external audits. Customer surveys can shed light on what 
users think about products and services provided. They can 
be used to determine user needs and to identify what 
product characteristics really matter to the users. Another 
line of questioning might concern the image of the 
organization and how it compares to the images of other 
organizations, be they competitors or not. The customer 
satisfaction survey is very common in society. Often it 
cannot be used to make inference to the target population of 
users due to its methodological and conceptual short- 
comings. The abundance of satisfaction surveys in society, 
developed and implemented by people with no formal 
training in survey methods, contributes to lukewarm 
receptions in more serious settings resulting in nonresponse 
and measurement errors. For instance, the 2007 Eurostat 
User Satisfaction Survey consisted of two separate surveys. 
One was launched on the Eurostat webpage and the target 
population consisted of 3,800 registered users. Only those 
registered users that entered the website during the data 
collection period were exposed to the survey request and 
this led to a response rate around 5%. The second survey 
used email that was sent to a number of main users 
identified by Eurostat. This more controlled environment 
generated a response rate of 28%. These surveys also have 
problems identifying the most suitable respondent. If the 
“wrong” respondent is chosen within an organization this 
will most certainly lead to uninformed and misleading 
results. 

The simplest type of self-assessment is the questionnaire 
or checklist that is filled out by the survey manager. An 
example is one from Statistics New Zealand. It is a checklist 
that consists of a number of indicators or assertions such as 
“information needs are regularly assessed through user 
consultation”, “good and accessible documentation’, 
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“indicators of accuracy regularly produced and monitored”, 
and “presentation standards met”. The manager is asked to 
answer yes or no to each assertion and make a comment if 
deemed necessary. Statistics Sweden had a similar system in 
place where one of the questions was “has overall quality of 
your product improved, declined or stayed the same 
compared to last year?” When results were compiled for 
these three categories for the entire organization, a very 
small proportion of the managers reported declining quality, 
a somewhat larger proportion reported improved quality, 
while a vast proportion reported status quo. The managers 
simply did not have the proper means to assess overall 
quality. Furthermore, vague quantifiers like “regularly”, 
“good”, and “meeting standards” invite generous assess- 
ments. Also most managers do not want to look bad and 
status quo becomes a perfect escape route. This system of 
self-assessment was eventually abandoned by Statistics 
Sweden. It is possible to increase the value of these 
assessments by asking additional questions concerning 
details about how and when quality work was conducted. 
Some organizations use internal teams that audit important 
products. Julien and Royce (2007) describe a quality audit 
of nine products at Statistics Canada, where the purposes 
were to identify weaknesses and their root causes as well as 
identifying best practices. Review teams of assistant 
managers were formed so that each reviewer reviewed three 
different programs. The main weakness with an approach 
like this is the internal feature itself. Every reviewer knows 
that sooner or later it is his or her turn to be reviewed and 
there is a risk that this fact might hold them back. It is also 
internal in the sense that users are not explicitly present in 
the review process. In its general audit program on data 
quality management, however, Statistics Canada puts great 
emphasis on its user liaison system (Julien and Born 2006), 
which is one of the five systems forming the agency’s 
quality assurance framework, the others being corporate 
planning, methods and standards, dissemination, and 
program reporting. 

A further variant of self-assessment is when it precedes 
an external audit. Statistics Netherlands (1997) describes 
how the Department of Statistical Methods is assessed by its 
staff. The assessment resulted in a listing of weak and strong 
areas that were later examined by an external team. 
Typically an external audit uses some kind of benchmark 
like a set of rules, a standard, or a code of practice for 
assessment purposes. The audit then results in a number of 
recommendations for the organization or the individual 
product or service. 

Recently a general system for evaluating the total survey 
error has been developed at Statistics Sweden. Sweden’s 
Ministry of Finance wants quality evaluation results to be 
able to monitor quality improvements over time. Survey 
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quality must be assessed for many surveys, administrative 
registers, and other programs within the agency so there is 
need for some indicators that can serve as proxies for actual 
measures of quality. At the same time, the assessment 
process must be thorough, the reporting simple and the 
results credible. For each of the error sources specification, 
frame, nonresponse, measurement, data processing, sam- 
pling, model/estimation, and revision eight key products 
were rated poor, fair, good, very good, and excellent 
regarding each of five criteria. The criteria were knowledge 
of risks, communication with users, compliance with 
standards and best practices, available expertise, and 
achievement toward risk mitigation and/or improvement 
plans. The rating guidelines varied by criterion. For 
knowledge of risks they were: 


An Example of the rating guidelines — Knowledge of risks 


Fair Good 


Poor Very Good Excellent 
cD @ © 

a 

Internal Internal — work has |Studies have _ | There is an 

program program been done to _|estimated ongoing 

documentation |documentation |assess the relevant bias _—_| program of 

does not acknowledges potential impact]and variance _|research to 


error source as a | of the error 
potential factor |source on data 
in data quality. | quality. 


acknowledge 
the source of 
error as a 
potential factor 


components evaluate all the 
associated with |relevant MSE 
the error source | components 
and are well- _|associated with 


for product documented. __| the error source 

accuracy. and their 
implications for 
data analysis. 
The program is 


well-designed 
and 
appropriately 
focused, and 
provides the 
information 
required to 
address the 
risks from this 


error Source. 


But: But: Studies 


little work has | Evaluations have not 
been done to have only explored the 
assess these considered implications of 


the errors on 
various types of 
data analysis 
including 
subgroup, trend, 
and multivariate 
analyses 


risks proxy measures 
(for example, 
error rates) of 
the impact with 
no evaluations 
of MSE 
components 


The evaluation process started with a self-assessment 
done by each of the eight key products. These reports and 
other relevant documents were studied by two external 
reviewers who then met with product owners and their staff 
to discuss the product processes. After that the reviewers 
presented detailed assessments and scored each product. 
The procedure identified important areas to improve within 
but also across products. In this first evaluation round 
measurement error turned out to be a problematic area for 
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almost all the key products. As any other approach at 
measuring or indicating total survey error this one does not 
really reflect total mean squared error. It requires thorough 
documentation of processes and improvements made and it 
is highly dependent on the skills and knowledge of the 
external reviewers. This study is reported in Biemer, 
Trewin, Japec, Bergdahl and Pettersson (2012). 


5.4 Quality profiles 


In continuing surveys there is an opportunity to develop 
quality profiles. Such documents contain all that is known 
about the quality of a continuing survey or other statistical 
product assembled over a number of years. Quality profiles 
exist for only a few major surveys, all, except one, 
conducted in the U.S., including the Current Population 
Survey (Brooks and Bailar 1978), the Survey of Income and 
Program Participation (Jabine, King and Petroni 1990; 
Kalton, Winglee and Jabine 1998), the Schools and Staffing 
Survey (Kalton, Winglee, Krawchuk and Levine 2000), and 
the American Housing Survey (Chakrabarty and Torres 
1996). The exception is the British Household Panel Survey 
(Lynn 2003). The main problem with a quality profile is that 
it is not timely, since it compiles results from often time- 
consuming studies of quality. The goal of the quality profile 
is to identify areas where knowledge about errors is 
deficient so that improvements can be made. Kasprzyk and 
Kalton (2001) and Doyle and Clark (2001) review the use of 
quality profiles in the U.S. 


6. Where do we go from here? 


Quality management ideas have been influential in many 
survey organizations. Concepts such as leadership, quality 
culture, problem prevention, customer, competition, risk 
assessment, process thinking, improvement, business excel- 
lence, and business architecture are increasingly discussed 
by leaders of survey organizations, e.g., Trewin (2001), Pink 
(2010), Fellegi (1996), Brackstone (1999), de Vries (1999), 
Groves (2011), and Bohata (2011). It seems as if the survey 
community is moving in a direction where statistics 
production becomes more streamlined and cost-effective but 
the pace is slow. Some organizations have started using a 
quality management model for self-assessment and steering 
purposes. EFQM is the recommended model for national 
statistical institutes within the European Statistical System 
and a couple of institutes, the Czech Republic and Finland, 
have even applied for their respective national EFQM 
awards. Some marketing firms are certified according to the 
ISO 9001 quality management standard and others are 
certified according to the ISO 20252 standard for market, 
opinion, and social research. This development ought to 
result in quality improvements but we cannot be really sure 
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until we start collecting relevant data. One thing is sure, 
though. Some customers prefer service providers that are 
certified, have won awards or can show evidence that they 
are working according to some quality framework or model. 
Very few customers would think that this is a negative 
thing. 

The margins of error that we associate with estimates are 
usually too short, since they do not include all sources of 
variation. Point estimates can be off due to biases. Ideally it 
would be good if we were able to produce estimates of the 
total survey error instead of what we produce today. Such a 
development is, however, not realistic. We are not in a 
position to produce such estimates, not even occasionally, 
for reasons that have to do with finances, timing and 
methodology. That leaves us with indicators of total survey 
error and its components. Such indicators are of limited 
value to the users. Users simply do not know what to do 
with information on nonresponse rates, response variance 
measured by reinterviews or edit failure rates. On the other 
hand, such indicators are very useful to the producers of 
surveys. For instance, reinterview studies can identify 
fabrication and survey questions with poor response 
consistency. A majority of users appreciate the service 
provider’s credibility and part of the credibility is the ability 
to present accurate data. Another important part of 
credibility is the willingness of the providers to evaluate 
their own quality and to report the results of such 
evaluations. Even if these evaluations show problems, it is 
better for the provider to find the problems than if entities 
outside the provider’s organization find them. Most users do 
not want to become involved in discussions about errors and 
trade-offs between errors and for good reasons. It is simply 
too technical and confusing. If we accept that a good 
process quality is a prerequisite for a good product quality, 
we should gradually improve the processes so that they 
approach ideal bias-free ones. In that way the variance of an 
estimate becomes a good approximation of the mean 
squared error. 

Despite endless discussions and a myriad of survey 
quality initiatives, practices have not changed much (Lynn 
2004; Pink, Borowik and Lee 2010; Groves 2011; Bohata 
2011). Perhaps the lack of competence within survey 
organizations is one root cause of the slow pace. Many 
theories and methodologies including | statistics, IT, 
management, communication, and behavioral sciences are 
needed in survey research. The behavioral sciences are 
needed to identify the root causes of nonsampling errors. If 
errors are just quantified no improvement can happen. 
Current training programs emphasize sampling, non- 
response, coverage and estimation in the presence of these. 
Other processes and error sources such as measurement and 
data processing are not dealt with to the same extent. This 
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leads to a situation where studies on measurement error and 
data processing error are rare compared to studies on, say, 
nonresponse. There is a considerable confusion regarding 
concepts and methods in both the producer and the user 
camps. Another cause of slow pace might be the consensus 
philosophy that rules in some organizations when it comes 
to decision-making regarding changes. This philosophy is 
one of compromise. Input from many stakeholders is 
gathered and a decision is usually based on the smallest 
common denominator, which is never a good standard. 
Furthermore, arriving at this compromise usually takes a 
long time and lots of resources. This approach is very far 
from Plan-Do-Check-Act. 

Survey quality is not an absolute entity. Current quality 
reporting a la one-size-fits-all is not working since fitness 
for use is defined by each user. Quality dimensions such as 
timeliness, comparability and accessibility should be 
decided together with main users while best possible 
accuracy given various constraints is the responsibility of 
the service provider. 

Have the survey quality discussion and the adoption of 
quality management strategies resulted in better data? We 
do not know. Survey quality has not been assessed in a 
before-after fashion. There is a tendency towards greater 
standardization and centralization, which should prove cost- 
efficient but when it comes to data quality some indicators 
point in the wrong direction. For instance, in many countries 
nonresponse rates are increasing and error properties of 
mixed-mode, translation of survey materials, and other 
design features are not fully known or are different across 
cultures. There is no design formula, which results in shaky 
trade-off decisions and problems deciding about intensities 
with which quality control should be applied. There is a 
persistent quest for best practices in survey organizations 
but implementation is difficult and scattered. There is 
definitely a great need for an upgrade in the competence 
level across the board. A_ structured international 
competence development program for service providers is 
necessary as is a systematic international collaboration on 
how to best design and implement surveys. We must serve 
our users better by providing data with small errors. We can 
do this by better combining our knowledge about statistics 
and cognitive phenomena with the principles of quality 
management. The great positive note is the overwhelming 
positive attitude toward quality improvement among 
statistical organizations around the world. 
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Data collection: Experiences and 
lessons learned by asking sensitive questions 
in a remote coca growing region in Peru 


Jaqueline Garcia-Yi and Ulrike Grote ' 


Abstract 


Coca is a native bush from the Amazon rainforest from which cocaine, an illegal alkaloid, is extracted. Asking farmers 
about the extent of their coca cultivation areas is considered a sensitive question in remote coca growing regions in Peru. As 
a consequence, farmers tend not to participate in surveys, do not respond to the sensitive question(s), or underreport their 
individual coca cultivation areas. There is a political and policy concern in accurately and reliably measuring coca growing 
areas, therefore survey methodologists need to determine how to encourage response and truthful reporting of sensitive 
questions related to coca growing. Specific survey strategies applied in our case study included establishment of trust with 
farmers, confidentiality assurance, matching interviewer-respondent characteristics, changing the format of the sensitive 
question(s), and non enforcement of absolute isolation of respondents during the survey. The survey results were validated 
using satellite data. They suggest that farmers tend to underreport their coca areas to 35 to 40% of their true extent. 


Key Words: Coca; Cocaine; Sensitive question; Misreporting; Nonresponse; Peru. 


1. Introduction 


Over the last 30 years, surveys have been increasingly 
used to explore sensitive topics (Tourangeau and Yan 
2007). For example, data obtained from surveys have been 
used to investigate “socially undesirable” behaviors, such as 
the prevalence of illicit drug use (e.g., Botvin, Griffin, Diaz, 
Scheier, Williams and Epstein 2000; Fergusson, Boden and 
Horwood 2008), illegal abortion (e.g., Johnson-Hanks 2002; 
Varkey, Balakrishna, Prasad, Abraham and Joseph 2000), or 
alcohol consumption among adolescents (e.g., Strunin 2001; 
Zufferey, Michaud, Jeannin, Berchtold, Chossis, van Melle 
and Suris 2007). Such surveys have been commonly utilized 
in academic research and policy analysis (Davis, Thake, and 
Vilhena 2009), even though asking sensitive questions has 
generally been seen as problematic. The responses have 
been considered to be prone to error and bias because 
respondents consistently underreport socially undesirable 
behaviors (Barnett 1998; Tourangeau and Yan 2007). Low 
response rates have been an additional concern. Those who 
are selected for a survey can simply refuse to take part in the 
survey or they can participate but refuse to answer the 
sensitive questions (Tourangeau and Yan 2007). 

Recent surveys at the household level have incorporated 
sensitive questions related to the extent of coca growing 
areas (see e.g., Ibanez and Carlsson 2010). Coca is a native 
bush from the Amazon rainforest in South America from the 
leaves of which cocaine is extracted. Colombia’s coca bush 
area represents 40%, Peru’s 40%, and Bolivia’s 20% of the 
total area under coca cultivation worldwide, amounting to 


154,100 hectares (UNODC 2011). In Peru and Bolivia, the 
leaves of this bush have been traditionally used for many 
purposes from around 3000 B.C. (Rivera, Aufderheide, 
Cartmell, Torres and Langsjoen 2005) until today. Those 
traditional uses mainly include coca chewing and coca tea 
drinking to overcome fatigue, hunger and thirst; and to 
relieve “altitude sickness” and stomach ache symptoms, 
respectively (Rospigliosi 2004). Since the 1970s, however, 
coca cultivation skyrocketed because of its use as the raw 
material for the production of cocaine (Caulkins, Reuter, 
Iguchi and Chiesa 2005). The cocaine content of the coca 
leaves is below 1%, and ranges from 0.13 to 0.86% 
(Holmstedt, Jaatmaa, Leander and Plowman 1977). There- 
fore narcotics traffickers need large quantities of coca leaves 
to obtain enough of the alkaloid for commercialization in 
the illegal market. In general, growing coca for the narcotics 
trafficking business is a profitable activity. In fact, the in- 
come of a coca growing farmer has been calculated to be 
54% higher than the income of a non coca growing farmer 
(Davalos, Bejarano and Correa 2008). 

Consequently, coca-related research has become oriented 
towards evaluating the profitability of coca versus other 
cash crops (see, e.g., Gibson and Godoy 1993; Torrico, 
Pohlan and Janssens 2005). Different attempts were made to 
replace coca by other crops, but it has been generally estab- 
lished that crop substitution as an anti-drug policy has been 
a failure (UNODC 2001). Decision makers and researchers 
have recognized that there are relevant socio-economic de- 
terminants that lead to coca growing other than economic 
profitability. These include social capital (Thoumi 2003), 
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saving account functions and financial reserve for large ex- 
penses (Bedoya 2003; Mansfield 2006). Comprehensive 
databases which include specific household-level informa- 
tion for coca growing areas are required to test those latter 
hypotheses. 

Coca growing is not illegal per se in Peru (During the 
1990s, the primary focus of the Peruvian Government was 
on “pacifying” the country by bringing terrorist groups 
under control. The Peruvian Government implemented what 
is currently known as the “Fujimori Doctrine”. The idea 
underlying this Doctrine was that the coca cultivation was 
not criminal in nature, but attributable to poverty. Conse- 
quently, the Fujimori Doctrine decriminalized all coca farm- 
ers, which diminished the farmers’ need for protection from 
terrorist associations, therefore making it easier for the 
Government to fight those violent groups (Obando 2006).), 
which partly reflects the social acceptance of traditional uses 
of coca in this country (UNODC 2001). Thus, the current 
legal framework seems to facilitate narcotics trafficking be- 
cause coca used in illegal trade can be cultivated under the 
guise of traditional uses (INCB 2009; Durand 2005). Ac- 
cordingly, Garcia and Antezana (2009) suggest that some 
farmers sell coca to those who purport to be traditional-use 
traders, but are actually narcotics traffickers who process 
coca leaves in different places, such as small towns at the 
border with Bolivia. 

Even though coca farming is not illegal, coca-growing 
regions which are perceived to be supplying narcotics traf- 
fickers (e.g., regions with large coca fields) can be targeted 
by the Government for the implementation of forced erad- 
ication programs (Obando 2006). After eradication, coca 
growers are likely to incur large economic losses, depending 
on the total extent of their individual coca cultivation areas. 
Thus, some of the farmers might be reluctant to provide 
information on whether or not they have any coca under 
cultivation. It should also be expected that some of the 
farmers who admit to cultivating coca, would not report the 
true extent of the area, given their fear that large coca fields 
could be more prone to eradication. 

Since there are both political and policy concerns in 
accurately and reliably measuring coca growing areas, it is 
necessary for survey methodologists to determine how to 
encourage response and truthful reporting of answers to 
sensitive questions related to coca growing. This article 
suggests and evaluates a number of strategies to increase 
both the reporting and the reliability of household—level 
responses in a remote coca growing region in Peru. 

Although the topic of this article is specifically related to 
coca growing, the lessons learned about survey design and 
implementation could be used as a reference for dealing 
with other sensitive topics such as health-related issues 
(e.g., anti-conception and sexual behavior) or undesirable 
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behaviors (e.g., illegal drug use) in other regions in different 
countries. 

The structure of the article is as follows: Section 2 
describes the community in Peru subject to study, the spe- 
cific strategies to reduce non-response and misreporting as 
well as the lessons learned from data collection related to 
sensitive questions in the research area. Section 3 presents 
the coca growing-related survey results and their validation, 
while Section 4 is comprised of a summary of the main 
results followed by the conclusion. 


2. Data collection in a coca-growing 
community in rural Peru 


This section describes the coca-growing community, and 
the primary data collection strategies applied in our study 
and the lessons learned. 


2.1. Description of the research area 


The research area was located in the Upper Tambopata 
valley at the border with Bolivia, one of the most remote 
and difficult to access Amazon rainforest areas in Peru 
(UNODC Office in Peru 1999). This valley lies in the 
Vilcabamba-Amboro Biodiversity Corridor in close prox- 
imity to national protected areas (see Figure 1). The entire 
population of the upper Tambopata valley is composed of 
immigrants, especially descendants from the Aymara 
indigenous population. Aymara is a native ethnic group 
originally from the Andes and Altiplano regions of South 
America. During the 1950s, most of the farmers were 
seasonal immigrants who left their Altiplano subsistence 
plots for only three to six months every year, and made the 
320 km journey to the upper Tambopata valley to cultivate 
coffee on their individually owned agricultural plots 
(Collins 1984). Over time, most farmers became permanent 
settlers in the upper Tambopata valley, and cultivate coffee 
as their main cash crop (ibid). 

Before 1989, coca cultivation in the upper Tambopata 
valley was very minor. Small-scale coca production was 
limited to self-consumption or local markets for traditional 
uses such as coca chewing by Andean farmers and miners. 
After 1989, coca cultivation was intensified, primarily in the 
neighboring upper Inambari valley. The change did not 
appear to be in response to increases in local demand or 
external demand by traditional users (UNODC Office in 
Peru 1999). Coca from those valleys is considered as low 
quality due to its bitterness, and it is in less demand for 
traditional chewing than coca from Cuzco region (Caballero, 
Dietz, Taboada and Anduaga 1998). Those increases were 
therefore related to narcotic traffic demand. In recent 
years, large increases in coca cultivation in the upper 
Tambopata valley have been consistently reported by the 
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United Nations (UN), as observed in Table 1. The per- 
centage variation per year in the upper Tambopata valley is 
above the annual change of around 4% at national level. 


Table 1 


Coca cultivation in the upper Tambopata Valley (2005-2008)* 


Percentage of 
variation 
in relation 


Year Hectares to previous year 
2005 O58 - 

2006 Sia 49.0 

2007 863 128.9 

2008 940 8.9 


*Since 2009 coca areas from the upper Tambopata valley are 
aggregated with coca areas from Inambari valley in UNODC 
reports. Therefore, it is not possible to estimate the percentage of 
variation in relation to previous year only for Tambopata valley 
during later years. 


Source: Own calculation using data from UNODC (2009). 
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Coca provided by the upper Tambopata valley and upper 
Inambari valley seems to mainly supply cross border trade 
associations between Peruvian and Bolivian narcotics traf- 
fickers. Bolivia remains the world's third largest producer of 
cocaine, and it is a significant transit zone for cocaine of 
Peruvian-origin (U.S. Department of State 2009). Those 
valleys constitute a strategic coca production area for nar- 
cotics traffickers due to their proximity to an external exit 
route (UNODC Office in Peru 1999). Coca leaves are not 
always transformed into cocaine in the agricultural plots. 
Narcotics traffickers seem to take advantage of the large 
quantities of coca leaves transported to urban areas, osten- 
sibly for traditional user markets. This coca is then purchased 
and processed at hidden facilities in urban areas near the 
Bolivian border. In this way the risk of being caught by 
authorities is reduced. From Bolivia the cocaine is dispatched 
to Brazil and Europe (Garcia and Antezana 2009). 


BRAZIL 


PACIFIC OCEAN 


Source: Own elaboration 

Map Description: 
X Altiplano area 
O Upper Tambopata Valley 
— « Immigration Route 


[_] Bahuaja Sonene National Park 

[_] Other protected areas 

[_] Vilcabamba-Amboro Biodiversity Corridor 
[_] Titicaca Lake 


Figure 1 Map of the research area 
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Coca cultivation does not necessarily translate into better 
quality of life for the farmers in South America (Davalos, 
etal. 2008). According to the last population census, the 
living conditions in San Pedro de Putina Punco (SPPP), the 
district located in the heart of the Upper Tambopata valley, 
are difficult: 72% of the houses are rammed earth construc- 
tions, 88% have dirt floors, 16% have public electricity, 
12% have public water, and only 9% have access to public 
sewage (INEI 2007). This situation is common in the major 
coca growing areas in Peru, where 70% of the inhabitants 
continue to live in poverty, and 42% in extreme poverty 
(Commission on Narcotic Drugs 2005). 


2.2 Data collection strategies and lessons learned 


A feasibility study to test if farmers would answer coca- 
related questions was conducted in December 2007. The 
pilot study for the designed questionnaire took place in May 
2008, and the final survey was conducted between June and 
August 2008. The feasibility and pilot studies and the final 
survey were focused on the farmers located in San Pedro de 
Putina Punco (SPPP), a district in the upper Tambopata 
valley which is located in the deepest rainforest. All the 
farmers in the research area produce coffee as cash crop and 
some supplement their income with coca cultivation. There 
are five coffee co-operatives in SPPP. Farmers have to 
become a member of one of these co-operatives in order to 
be able to sell their coffee, because restrictions to coffee 
intermediaries are in place. The final survey was only con- 
ducted among the members of four of these co-operatives 
because most of the members of the remaining co-operative 
are based in San Juan del Oro, a district outside the research 
area. 

The final survey consisted of a structured questionnaire 
which focused on agricultural production and social capital. 
The questionnaire was comprised of 15 sections: 


1. General information about the farmer and household 
General information about the agricultural plot and 
coffee area 

3. Additional economic activities 

4. Organic certification information 

5. Cognitive social capital and identity 

6. Information and communication 
f 
8 
9 


Personal aspirations and risk attitudes 
Structural social capital 
Covariant and idiosyncratic shocks 
10. Human capital 
11. Social networks 
12. Coca use traditions 
13. Detailed agricultural production costs 
14. Labor access 
15. Additional questions 
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The sensitive question related items of the survey are 
presented in the Appendix 1. 

Asking farmers about their coca growing area is a 
sensitive question. Farmers who cultivate large areas of coca 
fear that the information provided could be accessed by 
authorities responsible for eradication programs. Thus, they 
might have concerns about the possible consequences of 
giving a truthful answer should the information become 
known to a third party. In these cases, the farmers need to be 
assured anonymity. Farmers could also be tempted to pro- 
vide socially desirable answers to the interviewers. Coca has 
become an important focal symbol in the indigenous popu- 
lation’s struggle for self-determination (Office of Technol- 
ogy Assessment 1993). Coca “yes”, cocaine “no” constitutes 
the slogan of indigenous people (Henman 1990); the formu- 
lation tries to clearly separate traditional uses (“coca’’) from 
narcotics trafficking (“cocaine”). Hence, traditional uses 
such as coca chewing are ethnicity symbols (Allen 1981) 
and their persistence could be related to feelings of nation- 
alism in Peru (Henman 1990). In this sense, it could be 
expected that farmers would not find it very problematic to 
indicate that they grow coca, as long as they can associate it 
with traditional uses. On the other hand, due to the asso- 
ciation of larger production areas with illegal activities, coca 
growers may underreport the total extent of their coca 
production areas in an attempt to give the impression that 
they are growing only for traditional use. 

Several strategies can help to reduce the potential biases 
associated with question sensitivity, item and unit nonre- 
sponse and deliberate misreporting. These strategies in- 
clude: confidentiality assurances; careful selection of the 
data collection mode and setting of the sensitive question 
format; and tailoring interviewer characteristics and behave- 
ior (see Coutts and Jann 2008; Tourangeau and Yan 2007). 
Further information on the implementation of these strate- 
gies in our case study is provided below. 


Establishing trust, and anonymity assurances 


Farmers in coca growing areas tend to distrust external 
people. In this particular area, we found out that they trust 
the coffee co-operative directors. One of the directors of the 
coffee co-operatives signed a letter of presentation autho- 
rizing our research related to agricultural cultivation. The 
letter was shown to the farmers prior to conducting the 
survey. A pilot test conducted with and without the 
presentation letter demonstrated that the letter was important 
to reduce survey participation refusals. In the survey intro- 
duction, it was also indicated by the interviewer that the co- 
operative director authorized the survey because the director 
expected the results to benefit co-operative members. In 
addition, farmers were clearly told at the beginning of the 
survey that the data collected would remain confidential, 
and the academic purpose of the questionnaire was 
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high-lighted (see Appendix la). This anonymity assurance 
was short and precise in order to minimize suspicion among 
farmers as suggested by Singer, Hippler and Schwarz 
(1992). Coca growing was treated as a common and ordi- 
nary behavior in the research region, and a long and 
elaborate confidentiality assurance might have aroused 
farmers’ reservations instead of alleviating them. A brief 
reminder of the assurance of confidentiality was included in 
the middle of the questionnaire, before the questions related 
to traditional coca uses and prior to the sensitive question on 
the coca area. The reminder stated: “In this part of the 
survey, we will ask questions about coca uses and culti- 
vation. Please remember that the survey is anonymous and 
that there are no correct or incorrect answers” (See Ap- 
pendix 1b). This follows Willis (2005) who mentions that it 
is important to have warm-up questions and an announce- 
ment of the switching to the sensitive topic to reduce 
resistance to answer. 


Data collection mode 


Paper and pencil self-administration as data collection 
method was initially considered to try to reduce interviewer 
bias. However, during the feasibility study, it became evi- 
dent that many farmers, even those with above elementary 
school education (52% of the population; INEI 2007), were 
not able to read effortlessly. Farmers work in their fields 
almost all day long and do not have many opportunities to 
practice their reading skills. Similarly, audio computer- 
assisted self-interviewing (ACASI) the method of choice for 
collecting data on sensitive topics in developed countries 
(Mensch, Hewett and Erulkar 2003), was out of the scope of 
this project due to the lack of equipment and power supply, 
and the computer illiteracy in the research area. The use of 
computers was likely to have increased the anxiety and 
suspicion about the survey as described in the African 
situation by Mensch, ef a/. (2003). Therefore, a face-to-face 
interview was the data collection mode selected and 
emphasis was placed on the selection of interviewers, their 
training and behavior. 


Selection of interviewers, training, and interviewers’ 


behavior 


One problem with the selection of the interviewers was 
the lack of sufficiently educated professionals in the 
research area. Thus, a group of ten students from the nearest 
public university, located 16 hours away from the research 
area, was chosen as interviewers. All of the interviewers had 
Aymara or Quechua ethnic backgrounds; this was an at- 
tempt to partially match interviewer-respondent charac- 
teristics. It was thought that this could increase the likely- 
hood of participation because the matching was likely to 
increase trust and sympathy between the interviewer and the 
respondent (Tourangeau and Yan 2007). The interviewers 
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presented themselves as students from the local university, 
and no additional information was given about any uni- 
versity or organization outside of the country financing the 
study to avoid potential misunderstandings and reduce 
distrust among the respondents. During the pilot study, 
some farmers had indicated concerns about externally 
financed coca eradication programs and therefore references 
to external institutions were minimized. As a result, only 
partial information was given to the respondents. This is 
unconventional, but under the specific circumstances of the 
study, there was no other alternative without facing potential 
security problems. 

For training, the interviewers first attended a two-day 
workshop in Puno city, followed by a three-day workshop 
in the research area. The same group of interviewers also 
conducted the pilot study to test the questions and question- 
naire with the objective of identifying comprehension, 
recall, judgement and acceptability issues in the survey, and 
allowing rephrasing, eliminating or adding questions. The 
pilot study also allowed assessment of the performance of 
the interviewers, and in some cases identified areas re- 
quiring tailored training based on the feedback on perfor- 
mance. For example, at the beginning one of the inter- 
viewers was hesitant about asking the coca-related question 
and that interviewer obtained a higher than average number 
of nonresponses to the sensitive question. After tailored 
training, the interviewer was able to modify their inter- 
viewing approach. 


Format of the sensitive question 


The question format presupposed the sensitive behavior 
under study, as suggested by Tourangeau and Yan (2007). 
Therefore, farmers were not first asked if they had any coca 
areas, and then asked for the total extent of their coca areas. 
Instead, all farmers were directly requested to state the total 
extent of their coca areas (“What is your coca growing area 
in meters or hectares?”’). However, it was found during the 
pilot study that the farmers did not feel comfortable with 
this question format and they either skipped the question or 
simply withdrew from the survey. As a consequence, the 
question format was changed and a forgiving wording was 
used instead. Farmers were asked: “How many ‘little bushes 
of coca’ do you have in your agricultural plot?” Thus, the 
farmer could answer “Only a little, I have... coca bushes”. 
Even though a difference was hardly perceptible, with the 
former question it was more difficult for the farmers to start 
their answers with “Only a little...”. So, using the latter 
question, it was easier for the farmers to add apologetic 
explanations to their answers making them feel more 
relaxed. This latter sensitive question format also had the 
advantage of employing a familiar wording for the Aymara 


- who commonly use diminutives in their daily conversations. 


On the other hand, this question format might indirectly 


Statistics Canada, Catalogue No. 12-001-X 


136 Garcia-Yi and Grote: Data collection: Experiences and lessons learned by asking sensitive questions 


imply that the interviewer expected that the respondent had 
a small number of coca bushes likely resulting in under- 
reporting. Consequently, while nonresponses were avoided 
using this latter question format, underreporting was still 
expected to some extent. 


Time period for conducting the survey and data collection 
setting 


The farmers’ agricultural plots are scattered in the 
mountainous Amazon rainforest in Peru. It was difficult to 
reach individual farmers on their agricultural plots for the 
survey. Therefore, to conduct the survey, we mainly took 
advantage of the Saint Peter’s Day celebration and the 
General Assembly meetings of the co-operatives in June 
and August 2008 respectively, when the farmers conger- 
gated in the town square. Attendance to the General 
Assembly meetings is mandatory for all co-operative 
members so all of the targeted respondents would have been 
accessible at those events. The only way to reach or exit the 
town square is through an unpaved road. To take advantage 
of this, the survey was conducted in a large tent that was 
erected on the unpaved road on those key days. This tent 
had ten divisions, one for each pair of interviewer and 
respondent. Absolute privacy was not enforced because 
during the pilot study, it was found that farmers did not feel 
comfortable being the “only one” who was being inter- 
viewed; they preferred to see others doing the same. 
However, farmers were not able to overhear other farmers’ 
responses. Given that all farmers have to use the same 
unpaved road to reach the town square regardless of their 
specific geographic location, potential geographical biases, 
which in turn can be related to important variables such as 
farm size and income, were likely minimized in this 
research. 


Sampling representativeness 


A convenience sampling method was applied, but at the 
end of the survey, we asked the farmers for their co- 
operative registration number and used the co-operative 
registration lists to infer the sample’s representativeness. 
The co-operative registration number provided by the farm- 
er was written on separate piece of paper and was not 
attached to the respondent’s questionnaire. Respondents 
were informed about this procedure and were able to 
witness the procedure. 

The four co-operatives under study have 3,265 members 
in SPPP. Table 2 shows the number of respondents per co- 
operative. The number of collected questionnaires amounted 
to 508. In total, 12 respondents were excluded from the 
sample because their co-operative registration number was 
missing. In two cases, the farmers had refused to provide 
this information and in ten cases, the interviewers had 
forgotten to ask the respondents about their registration 
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number at the end of the interview.Therefore the absence of 
information was more associated with interviewer error than 
with the farmers’ unwillingness to provide this information. 


Table 2 
Number of respondents per co-operative 


Total Number Survey’s Percentage of 
of Co-operative Sample Co-operative 
Members in Size Members 
SPPP Interviewed 
(%) 
Co-operative | 756 106 14 
Co-operative 2 911 138 15 
Co-operative 3 887 138 16 
Co-operative 4 711 114 16 
Total 322605 496 1S 


Source: Own survey. 


In order to test for representativeness of the sample, the 
distribution of the co-operative registration numbers ob- 
tained from the survey sample was compared with the 
distribution of the co-operative registration numbers from a 
simulated simple random sample without replacement ob- 
tained from co-operative lists. The co-operative lists were 
ordered by the registration number of the co-operative 
members and co-operative registration numbers are asso- 
ciated with the members’ date of registration. Thus, most of 
the older farmers have lower registration numbers and the 
younger farmers have higher ones. Unfortunately, the co- 
operatives did not have other membership data available 
such as total land, coffee or coca hectares that might be used 
to select a stratified random sample. Two types of tests 
were used for comparison of the samples: a two-sample 
Wilcoxon rank-sum (Mann-Whitney) test and a two-sample 
Kolmogorov-Smimov test for equality of distribution func- 
tions. The first test assesses how probable it is that the two 
groups come from the same distribution, and assumes that 
differences observed are caused by chance fluctuation. The 
second test is similar to the first one, but in addition it is 
sensitive to differences in both the location and shape of the 
empirical cumulative distribution functions of the two 
groups. The results of both tests failed to reject the null 
hypothesis of equality of distribution between the survey 
sample and the simulated simple random sample at a 
significance level of 0.05. Thus, the results suggest that the 
survey sample is equivalent to a simple random sample, and 
therefore representative of the population under study. 


3. Survey results and validation issues 


3.1 Survey results 


The survey response rate was around 90%, which is well 
above the minimum recommended response rate of 60% 
(Punch 2003). From the 496 completed questionnaires, 19 
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respondents (less than 4%) did not answer the coca-related 
question. When comparing the descriptive statistics of 
socio-economic, institutional, and coca-related variables, 
there were some significant differences between all the 
observations (without the non-respondents) and the ‘sen- 
sitive question non-respondents’ (see Appendix 2). The sen- 
sitive question non-respondents were all male, with a larger 
percentage of Aymara ethnic background, and more chil- 
dren. In addition, a larger percentage of them used coca as 
medicine. Interestingly, significantly more non-respondents 
are highly risk averse (73.7%) compared to all the other 
respondents (28.6%). This could indicate a potential fear of 
the ‘sensitive question non-respondents’ of interviewer dis- 
closure of information to third parties. The setup of the risk 
aversion test followed by Binswanger (1980) is presented in 
Appendix Ic. 

Basic comparative descriptive statistics of coca and non 
coca growers are presented in Table 3. The number of valid 
questionnaires was 477, if we do not account for the non 
respondents of the sensitive question. Of them, 64% indi- 
cated that they are coca growers. 

There are no statistically significant differences with 
respect to general socio-economic characteristics (age, sex, 
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ethnic group, and number of children) between coca and 
non-coca growers. The only difference was observed in 
education. Non-coca growers have more years of schooling 
than coca growers. Coca growers have less total and 
primary forest areas, and more fallow land than non coca 
growers, although these differences are not statistically 
significant. Coca and non-coca growers have similar coffee 
and staple food areas. On the contrary, coca growers and 
non-coca growers show statistically significant differences 
in the social capital variables. More non-coca growers than 
coca growers find it important to obey national law. On the 
other hand, less non-coca growers than coca growers have 
experienced a negative change in trust towards their 
neighbors during the last five years, and have worked in 
community activities during the last year. 

There is a Statistically significant relationship between 
coca growing and traditional uses. A higher percentage of 
coca growers than non-coca growers chew coca and uses 
coca as medicine. More importantly, more coca growers 
find it easier to sell coca leaves than non-coca growers in the 
hypothetical case that they would cultivate coca for 
commercial purposes. 


Comparative descriptive statistics between coca and non coca growers 


Variable 


Coca Growers Non Coca Growers 


Age 


Male (%) 
Aymara (%) 
Number of Children 


Years of schooling 

Total area (ha) 

Coffee area (ha) 

Area secondary forest (fallow area) 
Primary forest area (ha) 

Staple food area (ha) 


No other economic activities (%) 

High risk aversion (%) 

Important to obey national laws (%) 

Negative change in trust in the last 5 years (%) 
Have worked in community activities in 2007 (%) 
Farmer chews coca (%) 

Farmer uses coca as medicine (%) 

Perception that it is easy to sell coca leaves (%) 
Number of coca bushes 


Number of Observations 


Standard deviations are in parentheses for continuous variables. 


42.5 41.7 
(12.7) (12.5) 
93.9 94.9 
81.4 82.5 
3.0 2.9 
(2.0) (2.1) 
8.2* Se 
(3.3) (3.3) 
7.9 8.0 
(8.4) (7.8) 
6X) 5, 
(2.0) (1.4) 
1.6 1.4 
(2.4) (2.1) 
3.9 4.2 
(7.5) (7.0) 
0.5 0.5 
(0.7) (0.6) 
46.8 48.9 
30.5 25.3 
81.9** 88.6** 
19.3** 12.5** 
92.0** 84.7** 
76.0*** 53.1*** 
81.7*** 54.8*** 
26.4** 18.5** 
3,093 j 
(6,710) 
305 12 


Coca Growers and Non Coca Growers means are statistically different (T-test with unequal variances) at: 
* ().1 significance level, ** 0.05 significance level, *** 0.01 significance level. 


Source: Own calculations. 
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Finally, it is important to mention that the average number 
of coca bushes is relatively low, which could be due to 
underreporting of commercial coca growing areas or to coca 
cultivation only for self-consumption, or both. It is not 
possible to distinguish between those two scenarios, which 
makes it easier for commercial coca growers to disguise 
themselves as coca growers who produce for traditional uses. 


3.2 Validation issues 


The validity of individual responses cannot be verified 
directly because there is little prior empirical research on 
this topic, and there is an absence of other sources of 
confirming data. However, it is possible to provide a rough 
comparison between the survey data and the total area of 
coca production recounted by international organizations for 
the upper Tambopata valley using satellite data. The United 
Nations Office on Drugs and Crime (UNODC 2009) 
indicates that 940 hectares of coca were cultivated in the 
upper Tambopata valley in 2008. The conventional coca 
cultivation density for regions with traditional coca growers 
could be between 35,000 and 40,000 bushes per hectare 
(UNODC 2001) (During the 90s, the coca cultivation 
density was lower, between 20,000 and 25,000 bushes per 
hectare (UNODC 2009)). The coca cultivation density in the 
particular valley is relatively low because coca growers 
intercrop coca with coffee and staples, although the yields 
per bush have increased during the last years (UNODC 
2009). Therefore, it is expected that the total number of coca 
bushes for this valley would be approximately from 32.9 to 
37.6 million. 

Our sample of 477 respondents (excluding farmers who 
did not report their co-operative registration number and 
non respondents to the sensitive question) reported a total of 
960,000 coca bushes. This sample corresponds to 14.6% of 
a total of 3,265 co-operative members in SPPP. Thus, 
extrapolating for the total number of co-operative members 
located in the SPPP district would result in a total of 6.6 
million coca bushes. In addition, we need to consider that 
the upper Tambopata valley also includes San Juan del Oro 
district which has around the same population as SPPP 
district (INEI 2007). Under the very strong assumption that 
farmers in SPPP behave similarly to the farmers in San Juan 
del Oro - at least in terms of coca cultivation - this would 
double the number of coca bushes for the entire upper 
Tambopata valley to around 13.2 million. This last estimate 
is between 35 and 40% of the 32.9 to 37.6 million obtained 
from UNODC satellite data. This result is in the expected 
range of reporting on sensitive issues. For reporting on 
abortion, this range is between 35 to 59% (Fu, Darroch, 
Henshaw and Kolb 1998), and for the use of opiates or 
cocaine between 30 to 70% (Tourangeau and Yan 2007). 
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4. Summary and conclusions 


Coca, a raw material for the production of cocaine, is 
cultivated in Colombia, Peru and Bolivia. In the latter two 
countries, traditional uses of coca by indigenous populations 
date back to around 3000 B.C. (Rivera, etal. 2005). 
Nevertheless, asking farmers about the extent of their coca 
cultivation areas is considered a sensitive question. Coca 
growers are afraid of eradication programs even if they do 
not sell coca to the narcotics traffic business because it is 
difficult to distinguish between coca growers whose produc- 
tion is commercially oriented and those who produce only 
for self-consumption. Thus, farmers tend not to participate 
in surveys, not to answer any sensitive questions, or to 
underreport their coca cultivation areas in an attempt to 
minimize their identification for possible eradication. 

Against this background, household-level data collection 
procedures need to consider and evaluate strategies to 
reduce nonresponses and misreporting. Most of the strate- 
gies used in our research area in Peru were based on best 
practices reported in the literature review. Some of the 
strategies that worked in our case were establishment of 
trust with the farmers using a presentation letter from a 
coffee co-operative director, confidentiality assurance at the 
beginning and in the middle of the questionnaire, matching 
of interviewer-respondent ethnic background characteristics, 
training of interviewers to reduce their hesitance to ask 
sensitive questions, changing the format of the sensitive 
question to a familiar and forgiving wording, and non 
enforcement of absolute privacy to prevent each farmer 
from feeling that they were the “only one” who was 
interviewed. 

The validity of farmers’ individual responses on their 
coca area extensions cannot be checked because the topic 
has produced little prior empirical research, and there is an 
absence of other sources of household-level confirming 
data. Thus, the extent of misreporting was evaluated using 
aggregate data. The results suggest that farmers only re- 
ported between 35 to 40% of their actual coca areas. Still, 
those values are between the ranges of what could be 
expected for answers to sensitive questions. In terms of 
survey nonresponse and sensitive question nonresponses, 
the results were more encouraging indicating values of 10% 
and of around 4%, respectively. 

When conducting the survey, we mainly took advantage 
of celebrations and co-operative General Assemblies for 
which farmers congregated in town, since farmers are 
otherwise highly dispersed in the rainforest. The survey 
followed a convenience sampling method but it was pos- 
sible to test the representativeness of this sample because all 
of the farmers are registered in one of the co-operatives in 
the research area. The obtained sample was compared with 
a simulated simple random sample without replacement 
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where each farmer had the same probability to be selected 
by chance from the co-operative member lists. There were 
no statistical differences in the distribution functions, so the 
sample is equivalent to a simple random one. The main 
drawback of this approach is that after the interview, we 
needed to ask the respondents for their co-operative member 
number. Even though the respondents were told that the co- 
operative identification number was not attached to their 
questionnaires, some farmers might have had doubts about 
it, and this could have had effects on confidentiality as- 
surance credibility in following interviews due to word 
spreading. 

On the other hand, comparing the characteristics of non- 
respondents to sensitive questions with the rest of re- 
spondents indicates that non-respondents were highly risk 
averse. Even though the number of non-respondents was 
small (less than 4% of the total sample), this could suggest 
that the main reason for item non-reporting is the fear of the 
consequences of the information leaking to third parties. 
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The coca areas reported by the farmers were on average 
very small. This could be an attempt by commercial coca 
growers to appear to be cultivating only for self-con- 
sumption. Coca growing for traditional uses does not have a 
negative connotation per se given that it is a symbol of 
ethnicity and the indigenous population’s struggle for self- 
determination (Office of Technology Assessment 1993). It 
is not possible to distinguish farmers who underreported the 
extent of their coca cultivation areas from those who grow 
coca for self-consumption. Unfortunately, commercial coca 
growers can take advantage of this situation to continue 
growing coca under the guise of traditional uses. 
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Appendix 1 


Relevant parts of the questionnaire 


A) Presentation: 


Good morning/afternoon/night. My name is . lama student at . We are conducting a survey to identify the risks and 
vulnerabilities of coffee producers in your community. The coffee co-operative directives are aware of this survey and believe that the result could benefit 
the community. If you decide to answer our questionnaire, you may skip any questions or withdraw from this study at any time. The data collected in this 
survey will remain CONFIDENTIAL and will be used only for ACADEMIC purposes. Your answers and opinions are extremely important for the co- 
operative and us. Would you be prepared to respond to some questions? 


a) Yes (proceed) 
b) No (thank the respondent, withdraw the survey, and indicate the characteristics of the person in format 1) 


B) Coca Related Questions: 
In this part, we will ask about coca uses and cultivation. Please, remember that this survey is anonymous and that there are no correct or incorrect answers. 


Do you chew coca leaves? a) Yes b) No 
Do you use coca leaves as medicine? a) Yes b) No 
Do you feel obligated to offer coca leaves to your guests during ayni and minka activities? a) Yes b) No 
Do you use coca leaves for rituals? a) Yes b) No 
Do you use coca leaves for payment to external workers ? a) Yes b) No 
Do you use coca leaves as product exchange or as a gift for friends and relatives? a) Yes b) No 


How many little bushes of coca do you have in your agricultural plot? 


C) Risk Aversion Question: 

This is a game. Before playing it, you need to choose one of the options displayed below. Then I toss a coin. If for example you have chosen option H, and I 
toss the coin and it is heads, you do not win any money at all; but if it is tails, you win S/.200. On the other hand, if you have chosen option A, you receive 
S/.50 regardless of if the tossed coin is heads or tails. Which option from all of the above would you choose before I toss the coin? 


OPTION If it is heads, you win: If it is tails, you win: 
A 50 soles 50 soles 
B 45 soles 95 soles 
G 40 soles 120 soles 
D 35 soles 125 soles 
E 30 soles 150 soles 
F 20 soles 160 soles 
G 10 soles 190 soles 
H 0 soles 200 soles 
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Appendix 2 


Comparative descriptive statistics between all observations and sensitive question non respondents 


Variable All Observations * Sensitive Question Non Respondent 
Age 42.2 45.9 
(12.6) (9.9) 
Male (%) 943555 ROOF 
Aymara (%) Siesta 94.7** 
Number of Children Sek Aes 
(2.0) (2.0) 
Years of schooling 8.4 Wes 
(3.3) (2.9) 
Total area (ha) T9 6.8 
(8.3) (3.2) 
Coffee area (ha) eo IS) 
(1.8) (1.2) 
Area secondary forest (fallow area) 1.6 1.4 
(2.3) (1.1) 
Primary forest area (ha) 4.0 2.9 
(7.3) (3.3) 
Staple food area (ha) 0.5 0.6 
(0.7) (0.6) 
No other economic activities (%) 47.5 57.9 
High risk aversion (%) DOK Wack = 
Important to obey national laws (%) 84.3 89.5 
Negative change in trust in the last 5 years (%) 16.8 26.3 
Have worked in community activities in 2007 (%) 89.4 89.5 
Farmer chews coca (%) (OME ah 
Farmer uses coca as medicine (%) (230% 84.2* 
Easy to sell coca leaves (%) 23.6 27.8 
Number of Observations 477 LS 


Standard deviations are in parentheses for continuous variables. 
a) All observations without sensitive question non respondents. 


Non respondent means are statistically different from the entire sample (T-test with unequal variances) at: 
* 0.1 significance level, ** 0.05 significance level, *** 0.01 significance level. 


Source: Own calculations. 
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Imputation for nonmonotone nonresponse 
in the survey of industrial research and development 


Jun Shao, Martin Klein and Jing Xu ' 


Abstract 


Nonresponse in longitudinal studies often occurs in a nonmonotone pattern. In the Survey of Industrial Research and 
Development (SIRD), it is reasonable to assume that the nonresponse mechanism is past-value-dependent in the sense that 
the response propensity of a study variable at time point ¢ depends on response status and observed or missing values of the 
same variable at time points prior to ¢. Since this nonresponse is nonignorable, the parametric likelihood approach is 
sensitive to the specification of parametric models on both the joint distribution of variables at different time points and the 
nonresponse mechanism. The nonmonotone nonresponse also limits the application of inverse propensity weighting 
methods. By discarding all observed data from a subject after its first missing value, one can create a dataset with a 
monotone ignorable nonresponse and then apply established methods for ignorable nonresponse. However, discarding 
observed data is not desirable and it may result in inefficient estimators when many observed data are discarded. We 
propose to impute nonrespondents through regression under imputation models carefully created under the past-value- 
dependent nonresponse mechanism. This method does not require any parametric model on the joint distribution of the 
variables across time points or the nonresponse mechanism. Performance of the estimated means based on the proposed 
imputation method is investigated through some simulation studies and empirical analysis of the SIRD data. 


Key Words: Bootstrap; Imputation model; Kernel regression; Missing not at random; Longitudinal study; Past-value- 


dependent. 


1. Introduction 


Longitudinal studies, in which data are collected from 
every sampled subject at multiple time points, are very 
common in research areas such as medicine, population 
health, economics, social sciences, and sample surveys. The 
Statistical analysis in a sample survey typically aims to 
estimate or make inference on the mean of a study variable 
at each time point. Nonresponse or missing data in the study 
variable is a serious impediment to performing a valid 
statistical analysis, because the response propensity (PSI) 
may directly or indirectly depend on the value of the study 
variable. Nonresponse is monotone if, whenever a value is 
missing at a time point 7, all future values at s >? are 
missing. We focus on nonmonotone nonresponse, which 
often occurs in longitudinal surveys. In the Survey of Indus- 
trial Research and Development (SIRD) conducted jointly 
by the U.S. Census Bureau and the U.S. National Science 
Foundation (NSF), for example, a business may be a 
nonrespondent on research and development expenditures at 
year ¢—1 but a respondent at year f. For ease we refer to 
SIRD in the present tense throughout, but we note that as of 
2008, it has been replaced by the Business R&D and 
Innovation Survey. 

Some existing methods for handling nonmonotone non- 
response can be briefly described as follows. The parametric 
approach assumes parametric models for both the PSI and 


the joint distribution of the study variable across time points 
(e.g., Troxel, Harrington and Lipsitz 1998, Troxel, Lipsitz 
and Harrington 1998). The validity of the parametric ap- 
proach, however, depends on whether parametric models 
are correctly specified. Vansteelandt, Rotnitzky and Robins 
(2007) proposed some methods under some models of the 
PSI at time ¢ conditional on observed past data. Xu, Shao, 
Palta and Wang (2008) derived an imputation procedure 
under the assumptions that (i) the PSI at ¢ depends only on 
values of the study variable at time ¢ —1 and (ii) the study 
variables over different time points is a Markov chain. 
Another approach, which will be referred to as censoring, is 
to create a dataset with “monotone nonresponse” by dis- 
carding all observed values of the study variable from a 
sampled subject after its first missing value. Methods ap- 
propriate for monotone nonresponse (e.g., Diggle and 
Kenward 1994, Robins and Rotnitzky 1995, Paik 1997) can 
then be applied to the reduced dataset. This approach may 
be inefficient when many observed data are discarded. 
Furthermore, in practical applications it is not desirable to 
throw away observed data. 

The purpose of this article is to propose an imputation 
method for longitudinal data with nonmonotone nonre- 
sponse under the past-value-dependent PSI assumption de- 
scribed by Little (1995): at a time point ¢, the nonresponse 
propensity depends on values of the study variable at time 
points prior to ¢. This assumption on the PSI is weaker than 
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that in Xu etal. (2008) and is different from those in 
Vansteelandt ef al. (2007). We consider imputation which 
does not require building a model for the PSI. Imputation is 
commonly used to compensate for missing values in survey 
problems (Kalton and Kasprzyk 1986). Once all missing 
values are imputed, estimates of parameters are computed 
using the estimated means for complete data by treating 
imputed values as observations. The proposed imputation 
and estimation methodology, including a bootstrap method 
for variance estimation, is introduced in Section 2. To 
examine the finite sample performance of the proposed 
method, we present some simulation results in Section 3. 
We also include an application of the proposed method to 
the SIRD. The last section contains some concluding 
remarks. 


2. Methodology 


We consider the model-assisted approach for survey data 
sampled from a finite population P. We assume that the 
population P is divided into a fixed number of imputation 
classes, which are typically unions of some strata. Within 
each imputation class, the study variable from a population 
unit follows a superpopulation. Let y, be the study variable 
abtime Point. femal che ¥ Vo.) 04 be ie Midi. 
cator of whether y, is observed, and 6 = (6,, ..., 6,). Since 
imputation is carried out independently within each imputa- 
tion class, for simplicity of notation we assume in this sec- 
tion that there is only a single imputation class. 

Throughout this paper, we consider nonmonotone non- 
response and assume that there is no nonresponse at baseline 
t = |. The PSI is past-value-dependent if 


PS? SI (ty, OR.1 6,3, 20 MO 
ss anid escola Fial Sy pwe Ska i Bimaese ert ro 


where P is with respect to the superpopulation. When non- 
response is monotone, the past-value-dependent PSI be- 
comes ignorable (Little and Rubin 2002), since we either 
observe all past values or know with certainty that y, is 
missing if it is missing at ¢ — 1, and an imputation method 
using linear regression proposed by Paik (1997) can be 
used. When nonresponse is nonmonotone, however, the 
past-value-dependent PSI is nonignorable because the 
response indicator at time ¢ 1s statistically dependent upon 
previous values of the study variable, some of which may 
not be observed. In this case Paik’s method does not apply. 


2.1 Imputation for subjects whose first missing 
is at ¢ 


Let ¢ > 1 bea fixed time point and r +1 be the time 
point at which the first missing value of y occurs. When 
r+1=t, ie., a subject whose first missing value is at ¢, 
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our proposed imputation procedure is the same as that for 
the case of monotone nonresponse (Paik 1997). However, 
we still need to provide a justification since we have a 
different PSI. It is shown in the Appendix that, under as- 
sumption (1), 


PON ate MO = 8 = ho, =) 
= EVA EV One Seen i= or 1) t = 2 ele (2) 


where E is the expectation with respect to the super- 
population. Denote the quantity on the first line of (2) by 
, »-; (Vp «5 Y;), Which is the conditional expectation of a 
missing y, given observed y,,..., y,). If 6, ,, 1s known, 
then a natural imputed value for y, is 0, ,,() --5 ¥-1): 
However, @, ,_; 18 usually unknown. Since 6, ,_, cannot be 
estimated by regressing y, on },,..., y,, based on data 
from subjects with missing y, values, we need to use (2), 
i.e., the fact that @,,_, is the same as the quantity on the 
second line of (2), which is the conditional expectation of an 
observed y, given observed y,,..., y,, and can be esti- 
mated by regressing y, on y,,..., y,,, using data from all 
subjects having observed y, and observed y,, ..., y,_;. Note 
that (2) is a counterpart of (5) in Xu ef a/. (2008) under the 
last-value-dependent assumption, which is stronger than the 
past-value-dependent assumption (1). Under a stronger 
assumption, we are able to utilize more data in regression 
fitting. 

Suppose that a sample S is selected from P according 
to a given probability sampling plan. For each ie S, 
5, = (6, --., 6,7) 1S observed, the study variable y, with 
6,, = 1 is observed, and y, with 6, = 0 is not observed, 
t=1,..., 7. With respect to the superpopulation, (y,, 5,) 
has the same distribution as (y,5) and (y,,6;,)’s are 
independentsiwhere yi, =iy7qyeesuy, proiFor # a 2ecenn, 


let ib 94) be the regression estimator of ,,_, based on 
observations with 6, =-+:=6,,_;, =1. A missing y, 
with observed yj, ..-5 Vjg-1) 1S then imputed by j,, = 
OF, 407855 Vier) 


To illustrate, we consider the case of ¢ =3 or 4. The 
horizontal direction in Table 1 corresponds to time points 
and the vertical direction corresponds to different missing 
patterns, where each pattern is represented by a vector of 0’s 
and 1’s with 0 indicating a missing value and 1 indicating an 
observed value. For ¢ = 3 and r = 2, as the first of the two 
steps, we consider missing data at time 3 with first missing 
at time 3, ie., pattern (1,1,0). According to imputation 
model (2), we fit a regression using data in pattern (1,1,1) 
indicated by + (used as predictors) and x (used as 
responses). Then, imputed values (indicated by ©) are 
obtained from the fitted regression using data indicated by 
* as predictors. For ¢=4 and r = 3, imputation in 
pattern (1,1,1,0) can be similarly done using data in pattern 
(1,1,1,1) for regression fitting. 


Survey Methodology, December 2012 


Table I 
Illustration of imputation process 
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Step 1: r=2,t=3 Step 2: r=1,¢=3 


Time 


Time 
Pattern 1 2 3 1 
(1,0,0) * 
(1,1,0) * * O + 
a + + x 
(1,0,1) 

Step 1: r=3,1=4 

Time 

Pattern 1 2 3 4 1 
(1,0,0,0) 
(1,1,0,0) A 
(1,1,1,0) * * * O + 
(1,0,1,0) 
(1,0,0,1) 
GeO) 
(1,0,1,1) 
(IU +5 + + x 


+: observed data used in regression fitting as predictors. 
x : observed data used in regression fitting as responses. 

(&) : imputed data used in regression fitting as responses. 
* : observed data used as predictors in imputation. 

© : imputed values. 

What type of regression we can fit to obtain y,? It is 
shown in the Appendix that, if (1) holds and E(y,|y,, ..., 
V) 1 )uis. linear in’ y,;..., ¥;, lor any’ 7 inthe case of no 
nonresponse, then 


= ee 


AS CAT ITs Vises sal 


Ey, ly. Beanbags) 
(3) 


and, hence, linear regression under the model-assisted ap- 
proach can be used to estimate ©, ,_,. If E(y, Ie Se jak 
is not linear, one of the methods described in Section 2.3 
can be applied. 


2.2 Imputation for subjects whose first missing is 
atr+1<t 


Imputation for a subject whose first missing value is at 
time r +1 <¢ is more complicated and very different from 
that for the case of monotone nonresponse. This 1s because 
when r + 1 < ¢ and nonresponse is monotone, 

Vane OMe oO 2k) 
ere HO), — 140) ab) 
yt ELAS Lath) 


E(Y | Vp 9 Yes 8) = 
S=SEVE bvists- a0), Opn 
Pe Whe es) h= 


whereas (4) does not hold when nonresponse is non- 
monotone (see the proof in the Appendix). Hence, we need 
to construct different models for subjects whose first miss- 
ing value is at r+ 1 <-+. It is shown in the Appendix that, 
when r +1 < tf, 


2 


Step 2: r=2,1r=4 


Z 


3 
O 


© 


Step 3: r=1,1=4 
Time 


3 


Time 


3 2 


2 @®@ o|s 


Done: lvoue0: GeO) 
ele ore ors 0) 
#=1ynt= 2,70 =2, \oT) 
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We now explain how to use (5) to impute missing values at 
a fixed time point f. Let ,,()),.-.,,) be the quantity on 
the first line of (5). If o,, is known, then y, can be imputed 
by 0,,.(,.-.¥,). Otherwise, it needs to be estimated based 
on (5). Unlike in model (2) or (4), the conditional expec- 
tation on the second line of (5) is conditional on a missing 
y, (6, = 0), although y,,..., vy. are observed. If we carry 
out imputation sequentially according to r=r-1,t- 
2,..., 1, then, fora given r < ¢—1, the missing y, values 
from subjects whose first missing is at time point r + 2 
have already been imputed using the method in this section 
or Section 2.1. We can fit a regression between imputed y, 
and observed y,, ..., vy using data from all subjects having 
already imputed y, (used as responses), observed },, ..., V, 
(used as predictors), and 6,,, = 1. Once an estimator b,, 1S 
obtained, a missing y, with first missing at r +1 is then 
imputed by 5, = b,,(Vis «-» Dy) 

Consider again the case of t= 3 or 4 and Table 1. 
Following the first step for ¢ = 3 discussed in Section 2.1, 
at the second step, we impute missing values with r = | in 
pattern (1,0,0). According to imputation model (5), we fit a 
regression using data in pattern (1,1,0) indicated by + (used 
as predictors) and @) (previously imputed values used as 
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responses). Then, imputed values (indicated by ©) are 
obtained from the fitted regression using data indicated by 
* as predictors. For ¢ = 4, following the first step dis- 
cussed in Section 2.1, at the second step (r = 2) we fit a 
regression using data in pattern (1,1,1,0) indicated by + 
(used as predictors) and @) (previously imputed values 
used as responses). Then, imputed values (indicated by ©) 
at ¢ = 4 in pattern (1,1,0,0) are obtained from the fitted 
regression using data indicated by * as predictors. At step 3 
for t = 4, we fit a regression using data in patterns (1,1,0,0) 
and (1,1,1,0) indicated by + (used as predictors) and @) 
(previously imputed values used as responses). Then, 
imputed values (indicated by ©) at ¢=4 in patterns 
(1,0,0,0) and (1,0,1,0) are obtained from the fitted regression 
using data indicated by * as predictors. 

Although at time f, imputation has to be carried out 
sequentially as r=f-1,...,1, imputation for different 
time points can be done in any order. This can be seen from 
the illustration given by Table 1, where the imputed values 
at tf = 3 are not involved in the imputation process at ¢ = 4 
or vice versa, although some observed data will be re- 
peatedly used in regression fitting. When data come 
according to time, it is natural to impute nonrespondents in 
the order ¢ = 2, ..., T. 

Why can we use previously imputed values as responses 
in the estimation of the regression function 9,, when 
r<t-—1? For given ¢ and r <¢-1, a previously im- 
puted value with first missing at s+1>r7+1 is an esti- 
mator of 


Paral (a abi. eaves ea ase ree lt Suid sO mene) 


EVE Vin Vo2O1 FF Sei, = 10)FO): 


By the property of conditional expectation and (5), 


E[E(Y,| ¥p---s Yoo 8) = 0 = 8,41 = 1,8, = 0) 
Vyseen Vey, = 7°? = Orgy = 1,6, = 0] 
= E(¥,| Vp 8) = 27° = 8,41 = 1,8, = 9) 
aay vagy Sek iO Sy 4 008,90) 


This means that y, and y, have the same conditional 
EXPEClALION, .DIVEN eV. -caN/ eyed a BO een On 
0. Therefore, using previously imputed values as responses 
in regression produces a valid estimator of @,,. Note that 
previously imputed values should not be used as predictors 
in regression, as equation (6) does not hold if some of 
Vj5-++».¥, are imputed values. 

Although all observed data at any time ¢ are used for the 
estimation of E(y,), some but not all observed data at time 
<?f are utilized in imputation to avoid biases under 
nonignorable nonresponse. This is different in the ignorable 
nonresponse case, where typically all past observed data can 
be used in regression imputation. 
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2.3 Regression for imputation 


The conditional expectations in (5) depend not only on 
the distribution of y, but also on the PSI. Even if E(y,| 
Vjo++> Y,_1) 1S linear, conditional expectations in (5) are not 
necessarily linear, which is different from the case of 
r+1=t considered in Section 2.1. An example is given 
by result (10) in the Appendix. 

When we do not have a suitable parametric model for 
,,, the nonparametric kernel regression method given in 
Cheng (1994) may be applied to obtain di. Since the 
regressor (),,,....V;,) is multivariate when r > 2, however, 
kernel regression has a large variability unless the number 
of sampled subjects in the category defined by 6,, = --: = 
Sicrst) = | is very large. This issue is commonly referred to 
as the curse of dimensionality. 

Thus, we consider the following alternatives under the 
additional assumption that the dependence of 6, on 
Vie V,, IS through a linear combination of ),,...,¥,_). 
That is, 


many) 


depending on 6,,...,6,_, and YY is an unknown function 
with range [0,1]. Under (7), it is shown in the Appendix 
that 


E(y;\ 23,8) D8y= 1)8),, =10,6)=20) 
SEY, | 28S S36 S85 P15 ="0) 
Re Sen aes 


where z,= Diu¥,,.¥ and y,, = y;"°r with 8, =--= 
5, = 1. Hence, to impute nonrespondents, we can condition 
on the linear combination z, and use (8), instead of 
conditioning on y,, ..., y, and using (5). 

Let y,,(z,) be the function defined on the second line 
of (8). Note that y,, is not necessary the same as 9,,. If 
there is a strong linear relationship between y, and y,,..., 
y,, then y,, may be approximately linear so that we can 
fit a linear regression to obtain an estimator W, ,. In theory, 
this method is biased when y,, is not linear. If y, = 
(Y,,1> ++ Y;,~) 18 known, then we can apply a one-dimen- 
sional kernel regression to obtain an estimator y, ., using 
the one-dimensional index z,. Since y, is unknown, we 
first need to estimate it by 7, and then obtain W,, by 
applying the one-dimensional kernel regression with y,, 
replaced by y,. For example, the sliced inverse regression 
(Duan and Li 1991) can be applied to obtain 7,. However, 
this type of nonparametric method may be inefficient. If 
there is a strong linear relationship between y, and 
Vp» ++ V-s We may apply linear regression to obtain 7,. In 
any Case, WE USE, Vin. Ve WI On = 29 One) as 
predictors and imputed y, values as responses in any type 


Survey Methodology, December 2012 


of regression fitting. After y,, and 7,=(¥,,,...7,,)' are 
obtained, a missing yj, is imputed by ¥.,= W,.(¥,.; Va + 
or eaves) 

We refer to the method of simply applying linear 
regression as the linear regression imputation method, and 
the method of applying kernel regression to the index z, as 
the one-dimensional index kernel regression imputation 
method. An advantage of one-dimensional index kernel 
regression imputation over kernel regression imputation is 
that only a one-dimensional kernel regression is applied and, 
thus, it avoids the curse of dimensionality and has smaller 
variability. 

These methods can also be applied to the case of 
r=t-—lif E(y,|y,... y,.;) is not linear. 

In theory, estimators such as the estimated means based 
on kernel regression or one-dimensional index kernel re- 
gression imputation are asymptotically unbiased, but they 
may not be better than those based on linear regression 
imputation when the number of sampled subjects in each 
(t, r) category is not very large. The performances of the 
estimated means based on linear regression, kernel regres- 
sion, and one-dimensional index kernel regression imputa- 
tion are examined by simulation in Section 3. 


2.4 Estimation 


We consider the estimation of the finite population total 
or the mean of y, at each fixed 7, which is often the main 
purpose of a survey study. At any ¢, let y, = y, when 
6, =1 and jy, be the imputed value using one of the 
methods in Section 2 when 6,, = 0. The finite population 
total and the mean of y, can be estimated by 

Sak ed ascat Fc nl seidueeiaitelvoebiclicn 

ieS ieS ieS 

respectively, where w, is the survey weight constructed 
such that, in the case of no nonresponse, Ve is an unbiased 
estimator of the finite population total at time ¢ with respect 
to the probability sampling. The superpopulation mean of 
y, can also be estimated by Y,. Note that ¥,-;w, is an 
unbiased estimator of the finite population size N and, for 
some simple sampling designs, it is exactly equal to N. 

The survey weights should also be used in the regression 
fitting for imputation. Under the same conditions given in 
Cheng (1994), Y, or Y, based on kernel regression or one- 
dimensional index kernel regression imputation is consistent 
and asymptotically normal as the sample size increases to 
oo, The required conditions and proofs can be found in Xu 
(2007). 

If we apply the linear regression imputation method as 
discussed in Section 2.3, then the resulting estimated mean 
at ¢ may be asymptotically biased. This bias is small if the 
function y,,, can be well approximated by a linear function 
in the range of the data values. On the other hand, kernel or 
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one-dimensional index kernel regression imputation may 
require a much larger sample size than that for linear re- 
gression imputation. Hence, the overall performance of the 
estimated mean based on linear regression imputation may 
still be better, as indicated by the simulation results in 
Section 3. 


2.5 Variance estimation 


For assessing statistical accuracy or inference such as 
constructing a confidence interval for the mean of y, at f, 
we need variance estimators of Y, or Y, based on imputed 
data. Because of the complexity of the imputation proce- 
dure, it is difficult to obtain explicit formulas for variance of 
Y, or ¥. The bootstrap method (Efron 1979) is then 
considered. A correct bootstrap can be obtained by repeating 
the process of imputation in each of the bootstrap samples 
(Shao and Sitter 1996). Let 6 be the estimator under 
consideration. A bootstrap procedure can be carried out as 
follows. 


1. Draw a bootstrap sample as a simple random sample 
of the same size as S’ with replacement from the set 
of sampled subjects. 

2. For units in the bootstrap sample, their survey 
weights, response indicators, and observed data from 
the original data set are used to form a bootstrap data 
set. Apply the proposed imputation procedure to the 
bootstrap data. Calculate the bootstrap analog 6° of 


0. 
3. Independently repeat the previous steps B times to 
obtain 6"',..., 68. The sample variance of 6”, ..., 


6°” is the bootstrap variance estimator for 6. 


In application, each 6°” can be calculated using the b™ 
bootstrap data (y,,5,,w,’), i ¢ S, where w,” = w, multi- 
plied by the number of times unit i appears in the b” 
bootstrap sample. Note that the same w,” can be used for 
all variables of interest, not just y,. 


3. Empirical results 


We study Y, or Y, in (9) based on the proposed 
imputation methods at each time point ¢. We first consider 
a simulation with a normal population for the y,’s. An 
application to the SIRD data is presented next. To examine 
the performance of the proposed methods for the SIRD, a 
simulation with a population generated using the SIRD data 
is presented in the end. We have implemented the proposed 
imputation methods in R(R Development Core Team 
2009). To fit the required nonparametric regressions, we use 
the R function /oess with default settings, which fits a local 
polynomial surface in one or more regressor variables. The 
required linear regressions are easily fit in R using the 
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function /m. Our implementations of the proposed methods 
include error checking; (such as ensuring that there are 
sufficient points for regression fitting at each stage) which is 
particularly important in bootstrap and simulation settings 
where the imputation methods are replicated many times, 
and each iteration cannot be examined manually. We 
defaulted to an overall mean imputation in cases where there 
were not enough data points to fit a regression. 


3.1 Simulation results from a normal population 


A simulation study was conducted with normally 
distributed y,,...,y,,” =2,000, and 7 = 4. A single impu- 
tation class and simple random sampling with replacement 
was considered. In the simulation, y,’s were independently 
generated from the multivariate normal distribution with 
mean vector (1.33, 1.94,2.73,3.67) and the covariance 
matrix having the AR(1) structure with correlation coef- 
ficient 0.7 and unit variance; all data at t =1 were ob- 
served; missing data at ¢t = 2,3,4 were generated ac- 
cording to 


POR Wines Oe Oe 


1-0(0.6(1- Dy, 77") a 


_ f#0-8,)/ 


[A + (1-6, )k] 
k=l 
and © is the standard normal distribution function. The 
unconditional probabilities of nonresponse patterns are 
given in Table 2. 

For comparison, we included a total of nine estimators of 
the mean of y,: they are sample means based on (1) the 
complete data (used as the gold standard); (2) respondents 
with adjusted weights assuming the probability of response 
is the same within each imputation class; (3) censoring and 
linear regression imputation, which first discards all 
observations of a subject after the first missing value to 
create a dataset with “monotone nonresponse” and then 
applies linear regression imputation as described in Paik 
(1997); (4) the proposed kernel regression imputation; (5) 
the proposed linear regression imputation; (6) the proposed 
one-dimensional index kernel regression imputation using 
the sliced inverse regression to obtain y,; (7) the kernel 
regression imputation proposed in Xu ef al. (2008) based on 
the last-value-dependent PSI; (8) the linear regression 
imputation based on a regression between respondents at 
time ¢ and observed and imputed values at time points 
1,...,¢-—1 (treating imputed as observed); (9) the linear 
regression imputation based on a regression between re- 
spondents at time ¢ and observed data from units with the 
same missing pattern at time points 1, ..., ¢ — 1. 
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Table 2 
Probabilities of nonresponse patterns in the simulation study 
(Normal population) 


Pattern Probability 
Monotone (1,0,0,0) 0.062 
(1,1,0,0) 0.043 total = 0.181 
(1,1,1,0) 0.076 
Intermittent (1,0,0,1) Onlehs 
(1,0,1,0) 0.071 cS 
(10,11) 0.186 total = 0.494 
(1,1,0,1) 0.124 
Complete (SLE) 0.325 


Method (2) simply ignores nonrespondents and, hence, is 
biased and inefficient. Under the PSI assumption (1) 
methods (7)-(9) are also biased for ¢ > 3, because method 
(7) requires the last-value-dependent assumption that is 
stronger than (1), method (8) treats previously imputed 
values as observed in regression, and method (9) requires 
the following condition that is not true under (1): 


FO eo, tne Oe eee) 
= Ey, GARONA RO Tiaxt OS ie On 1) 


where (/j,, .-., /,_;) 18 a fixed missing pattern. Finally, as we 
discussed in Section 2.3, method (5) is also biased for ¢ 2 3 
since linear regression is not an exactly correct model. 
However, methods (5), (8), and (9) may still perform well 
when the biases are not substantial, because the use of a 
simpler model and more data in regression for imputation 
may compensate for the loss in biased imputation. Further- 
more, any assumption on the PSI may hold only approxi- 
mately and it is desired to empirically study various 
methods in any particular application. 

For the case of r = ¢ —1, linear regression imputation is 
applied as discussed in Section 2.1. Hence, methods (3)-(6), 
(8)-(9) all give the same results when ¢ = 2. 

Table 3 reports (based on 1,000 simulation runs) the 
relative bias and standard deviation (SD) of the mean 
estimator, the mean of SDiscr. the bootstrap estimator of 
SD based on 200 bootstrap replications, and the coverage 
probability of the approximate 95% confidence interval (CI) 
obtained using point estimator +1.96xSDboo. The 
following is a summary of the results in Table 3. 


1. The sample mean based on ignoring missing data is 
clearly biased. Although in the case of ¢ = 4 its 
relative bias is only 3.5%, it still leads to a very low 
coverage probability of the confidence interval, 
because the SD of the estimated mean is also very 
small. 

2. The bootstrap estimator of standard deviation per- 
forms well in all cases, even when the mean estimator 
is biased. 

3. Y based on censoring and linear regression impu- 
tation has negligible bias so that the related 
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confidence interval has a coverage probability close 
to the nominal level 95%; but it has a large SD when 
t=3 or t =4. The inefficiency of this method is 
obviously caused by discarding observed data from 
nearly 50% of sampled subjects who have inter- 
mittent nonresponse. Its performance becomes worse 
as ¢ increases. 

Y, based on the proposed kernel regression impu- 
tation has a relative bias between 0.0% and 0.5%, but 
the bias is large enough to result in a poor coverage 
performance of the related confidence interval at 
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5. Y, based on the proposed linear regression imputa- 


tion has negligible bias as well as a variance smaller 
than that of Y, based on kernel regression. The 
related confidence interval has a coverage probability 
close to the nominal level 95%. 

Y based on the proposed one-dimensional index 
kernel regression imputation is generally good but 
slightly worse than that based on the linear regression 
imputation. 

i based on methods (7)-(9) has non-negligible bias 
when ¢ = 3 or ¢ = 4, which results in poor perfor- 


t= 4. 


Table 3 


Simulation results for mean estimation (Normal population) 


Method 
Complete data 


Respondents only 


Censoring and linear regression 
imputation 


Proposed kernel regression 
imputation 


Proposed linear regression 
imputation 


Proposed 1-dimensional index 
kernel regression imputation 


Last-value-dependent kernel 
regression imputation 


Linear regression imputation 
treating previously imputed 
values as observed 


Linear regression imputation 
based on currently and 
previously observed data 


Quantity 
relative bias 
SD 

SD 

CI coverage 


relative bias 
SD 

SDico 

CI coverage 


relative bias 
SD 

SD x 

CI coverage 


relative bias 
SD 

SDs 

CI coverage 


relative bias 
SD 

SDice 

CI coverage 


relative bias 
SD 

SDda 

CI coverage 


relative bias 
SD 

spe 

CI coverage 


relative bias 
SD 

SDs 

CI coverage 
relative bias 
SD 

SD ee 

CI coverage 


t=2 
0% 
0.0221 
0.0223 
94.9% 


12.8% 
0.0282 
0.0285 

0.0% 


0.0% 
0.0275 
0.0276 
95.1% 


0.0% 
0.0275 
0.0276 
95.1% 


0.0% 
0.0275 
0.0276 
95.1% 


0.0% 
0.0275 
0.0276 
05.1% 


0.6% 
0.0284 
0.0288 
93.7% 


0.0% 
0.0275 
0.0276 
95.1% 


0.0% 
0.0275 
0.0276 
95.1% 


mance of the related confidence interval. 


t=3 t=4 
0% 0% 
0.0223 0.0221 
0.0223 0.0224 
94.4% 95.4% 
6.8% 3.5% 
0.0272 0.0248 
0.0267 0.0252 
0.0% 0.2% 
0.0% -0.1% 
0.0358 0.0418 
0.0354 0.0431 
94.6% 95.6% 
0.4% 0.5% 
0.0288 0.0283 
0.0288 0.0288 
92.5% 88.6% 
0.1% 0.0% 
0.0286 0.0279 
0.0287 0.0293 
93.8% 95.7% 
0.4% 0.4% 
0.0288 0.0279 
0.0288 0.0288 
92.5% 91.7% 
1.0% 0.6% 
0.0310 0.0257 
0.0295 0.0263 
84.2% 86.2% 
1.6% 0.8% 
0.0261 0.0241 
0.0260 0.0246 
59.7% 76.0% 
1.6% 0.8% 
0.0261 0.0242 
0.0261 0.0246 
59.0% 76.1% 
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Although the kernel regression is asymptotically valid, in 
this simulation study the total number of subjects is 2,000 
and, according to Table 2, the average numbers of data 
points used in kernel regression under patterns (¢, 7) = 
(4,1) and (4,2) are 238 and 152, respectively, which may 
not be enough for kernel regression and lead to some small 
biases in imputation. On the other hand, linear regression is 
more stable and works well with a sample size such as 152. 
Although linear regression imputation has a bias in theory, 
the bias may be small when E(y,| y,, ..-, ,_) is linear. 


3.2 Application to the SIRD 


The SIRD is an annual survey of about 31,000 compa- 
nies potentially involved in research and development. The 
NSF sponsors this survey as part of a mandate requiring that 
NSF collect, interpret, and analyze data on scientific and 
engineering resources in the United States. The survey is 
conducted jointly by the U.S. Census Bureau and NSF. The 
surveyed companies are asked to provide information 
related to their total research and development (RD) 
expenditure for the calendar year of the survey. The SIRD 
deterministically surveys some companies each year by 
placing them in a certainty stratum, since they account for a 
large percentage of the total RD dollar investment in the 
U.S. The remaining companies that appear in the survey are 
sampled each year using a stratified probability propor- 
tionate to size (PPS) sampling design. Longitudinal mea- 
surements are available on the core of companies that are 
sampled with certainty and on other companies that happen 
to be selected each year. For the purposes of illustrating our 
imputation methods, we restrict attention to only those 
companies that were selected for the survey in each of the 
years 2002 through 2005 (7 = 4), and companies that 
provided a response in 2002. For documentation on the 
SIRD and detailed statistical tables, we refer to the 
document titled Research and Development in Industry: 
2005, available from http://www.nsf.gov/statistics/nsf103 19. 
Additional information on the Business R&D and 
Innovation Survey is available online at http://bhs.dev.econ. 
census.gov/bhs/brdis/ and __ http://www.nsf.gov/statistics/ 
srvyindustry/about/brdis/. 

We divide the data into two imputation classes. One class 
consists of all companies contained in a certainty stratum for 
each of the four years; the other consists of the rest of 
companies. Within each imputation class, the data take the 
form (y,,5,;),i =1,...,, Where y, represents the total 
RD expenditure for company i at time ¢=1 (2002), 2 
(2003), 3 (2004), 4 (2005). The sample size here is n = 
2,309 for the certainty strata class and n = 1,039 for the 
non-certainty strata class. Missingness 1s nonmonotone and 
the missing percentages for the years 2003, 2004, and 2005 
were 10.4%, 14.0%, and 18.8%, for the certainty strata 
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class, and 15.2%, 20.7%, and 26.0% for the non-certainty 
strata class. 

Table 4 shows the estimated totals and standard errors 
obtained by using the methods (2)-(9) described in the 
simulation study in Section 3.1. As discussed in the end of 
Section 2.1, in each of the proposed imputation methods we 
use linear regression when r +1 = ¢. The standard errors 
shown in Table 4 were computed using the bootstrap 
method. Table 4 also displays estimated totals obtained 
when missing data are filled in by the values that were put in 
place by the Census Bureau in order to produce the 
officially published data tables (officially published data 
tables are available from http://www.nsf.gov/statistics/ 
pubseri.cfm?seri_id=26). The method that was used by the 
Census Bureau to handle missing data when producing 
these published data tables (which we call the “current 
method”) was ratio imputation for companies with prior 
year data using imputation cells formed by industry type; we 
refer to Bond (1994) for further details. Table 4 also 
presents the estimated RD totals obtained from respondents 
only with no weight adjustment which indicate that ignoring 
the missing data leads to biased estimates. Methods (3)-(9) 
give comparable results, which is likely due to the strong 
linear dependence in the data so that theoretically biased 
methods exhibit negligible bias. The estimated totals based 
on the current method are comparable to those based on the 
proposed methods for the certainty strata case, but are 
different in the non-certainty strata case. The method of 
censoring and linear regression has similar SD to the 
proposed methods because the number of data points 
discarded under censoring is not too large. In the certainty 
strata imputation class only 10% of the sample has an 
intermittent nonresponse pattern and the percentage of 
complete cases is 72%. In the non-certainty class, only 9% 
of the sample has an intermittent nonresponse pattern and 
the percentage of complete cases is 66%. 


3.3 Simulation results based on the SIRD population 


An additional simulation study was conducted using a 
population constructed from the SIRD data. The simulation 
was run independently for the certainty strata and non- 
certainty strata imputation classes. To construct the popula- 
tion, we begin with the SIRD data with missing values 
imputed using the current imputation method for the SIRD. 
Let 6, be the observed response indicator vector for 
company i and y, be the vector of either the observed or 
imputed values of total RD expenditures for company i 
over time, i = 1, ..., n. For the simulation, we sample from 
a population based on {(¥,, 5,), 7 = 1, ..., 2} as follows. We 
first draw a sample of size n with replacement from 
Y,>-» Y,» then we add independent normal random noise, 
with mean 0 and standard deviation 500, to each component 
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of each of the sampled vectors. Any resulting negative 
values are set to zero. We denote these simulated RD totals 
by y,,...,y,, where 7 is the same as that in Section 3.2. 
We denote the simulated response indicators by 6,,..., 8,. 
Forall\i sand’each ¢ = 2/3,4, os were binary random 
variables with 


HiGra= Lisyies..¥, 39) 
a exp(By” y Bratt ware Be ooiaen) 
Naa exp(By)’ an By, 5 areas pepe) 


The coefficients Bf”, Bt”, ..., BI, are fixed throughout the 
simulation and they were obtained as the estimated coeffi- 
cients from an initial fit of a logistic regression of 6, on 
ie. Wee )) fondiaal eet sn: 

Table 5 reports the simulation results for total estimators 
based on 1,000 runs and methods (1)-(9) described in 


Section 3.1, where the quantities appearing in the table are ° 


defined in Section 3.1. To compute the relative bias we 
obtain the true value of the total through a preliminary run 
of the simulation model. Several of the conclusions from the 
normal population simulation of Section 3.1 carry over to 
this setting. The following is a summary of some additional 
findings. 


Table 4 
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1. In contrast to the normal population simulation 
setting, the estimated total based on censoring and 
linear regression has SD that is comparable with the 
proposed imputation methods. This is because the 
number of data points discarded under censoring is 
small in this case. The probabilities of an intermittent 
response pattern are 17% and 19% for the certainty 
and non-certainty strata classes, respectively. In the 
normal population simulation these probabilities were 
nearly 50% as shown in Table 2. 

2. All of the proposed imputation methods give rela- 
tively similar performance. As noted previously, 
linear regression imputation is generally biased in 
theory. However, the bias is small because of the 
strong linear dependence in data. 

3. Method (7) does not have a good performance at 
t >3 for the non-certainty strata case, because the 
last-value-dependent PSI assumption does not hold. 

4. Methods (8) and (9) perform well, again due to the 
strong linear dependence in data. Although these 
methods use more observed data in regression 
imputation, they are comparable with the proposed 
linear regression method. 


RD total estimates (in thousands) from SIRD data based on years 2002 to 2005. 


Bootstrap standard error (in thousands) in parentheses‘ 
Method 


Certainty strata Non-certainty strata 


t=2 t=3 t=4 t=2 t=3 t=4 

Current imputation 154,066 156,754 168,015 DAS SE DTS De he? 
Respondents only with no weight adjustment 149,502 148,300 159,822 2,448 D353 Deal 
(15,907) (16,160) (17,149) (172) (193) (207) 

Respondents only with adjusted weights 166,924 172,419 196,815 2,887 BONO IES V.69 
(17,728) (18,720) (21,045) (199) C3 ieee) 

Censoring and linear regression imputation 154,824 159,206 172,631 2,843 B10 79s 57. 
(15,888) (16,394) (17,470) (189) (208) (246) 

Proposed kernel regression imputation 154,824 159,394 171,633 Ase) —— EOS IISSII 
(15,888) (16,414) (17,603) (189) (199) (290) 

Proposed linear regression imputation 154,824 159,198 172,042 2,843 Beis) 8) ax 
(15,888) (16,383) (17,247) (189) (203) (250) 

Proposed |-dimensional index kernel regression imputation 154,824 159,394 171,494 2,843 D997, aot 
(15,888) (16,414) (17,268) (189) (199) (248) 

Last-value-dependent kernel regression imputation 154,688 158,768 170,606 DSI 229 SSO Na. 
(15,900) (16,286) (17,234) (188) (197) (240) 

Linear regression imputation treating previously imputed 154,824 159,401 172,600 2,843 BOO Smee) 
values as observed (15,888) (16,390) (17,306) (189) (208) (236) 
Linear regression imputation based on currently and 154,824 160,205 172,452 2,843 Sitters 1 S73} 
previously observed data (15,888) (16,534) (17,209) (189) (233) (254) 


"Disclaimer: The values in Table 4 do not necessarily represent national estimates because we have made some restrictions on the data to fit our 


framework. 
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Table 5 


Simulation results for total estimation (in thousands) SIRD based population 


Method Quantity Certainty Strata Non-Certainty Strata 
t=2 t=3 t=4 t=2 t=3 t=4 
Complete data relative bias 0% 0.1% 0.1% 0.2% 0.0% 0.4% 
SD 15,541 16,045 16,947 184 203 224 
SDs 15,654 15,994 16,941 186 201 218 
CI coverage 94.0% 94.0% 94.3% 94.3% 93.7% 93.9% 
Respondents only with relative bias 5% 6.3% 11.6% -1.1% 1.1% -2.7% 
adjusted weights SD 16,870 17,858 20,032 19] 220 244 
spe 16,917 17,915 20,048 192 219 234 
CI coverage 94.8% 94.8% 87.3% 93.2% 94.5% 89.8% 
Censoring and linear relative bias 0% 0.4% 0.5% 0.4% 0.1% -0.4% 
regression imputation SD 15,582 16,272 17,247 191 214 238 
SD. 15,654 16,145 17195 194 214 236 
CI coverage 93.8% 93.5% 94.2% 94.8% 94.0% 93.7% 
Proposed kernel regression relative bias 0% 0.2% -0.1% 0.4% -0.3% -0.3% 
imputation SD 15,582 16,130 17,098 191 205 246 
only 15,654 16,072 L723 194 204 262 
CI coverage 93.8% 93.5% 94.2% 94.8% 93.4% 93.7% 
Proposed linear regression relative bias 0% 0.2% 0.0% 0.4% 0.0% -0.5% 
imputation SD 155582 16,130 16,955 19] 206 229 
SD ivi 15,654 16,072 16,964 194 206 224 
CI coverage 93.8% 93.5% 94.2% 94.8% 94.0% 93 7% 
Proposed 1-dimensional relative bias 0% 0.2% -0.1% 0.4% -0.3% -0.9% 
index kernel regression SD NSSrow 16,130 16,957 19] 205 D4 
imputation SDis 15,654 16,072 16,965 194 204 220 
CI coverage 93.8% 93.5% 94.3% 94.8% 93.4% 93.1% 
Last-value-dependent relative bias 0% 0.1% -0.3% 0.0% -0.7% -0.7% 
kernel regression SD ISSOSS 16,019 16,990 184 204 242 
imputation Spa 15,635 16,003 16,983 187 202 230 
CI coverage 93.8% 93.7% 94.0% 93.9% 92.7% 91.1% 
Linear regression relative bias 0% 0.2% 0.0% 0.4% 0.6% -0.6% 
imputation treating SD 15,582 16,120 16,952 191 210 24 
previously imputed values SD 15,654 16,065 16,954 194 210 225 
as observed CI coverage 93.8% 93.6% 94.3% 94.8% 93.8% 92.8% 
Linear regression relative bias 0% 0.2% 0.0% 0.4% 0.6% -0.6% 
imputation based on SD 15,582 16,117 16,945 191 213 241 
currently and previously SDnx 15,654 16,062 16,954 194 211 254 
observed data CI coverage 93.8% 93.5% 94.3% 94.8% 93.6% 93.7% 


4. Concluding remarks 


We consider a longitudinal study variable having non- 
monotone nonresponse. Under the assumption that the PSI 
depends on past observed or unobserved values of the study 
variable, we propose several imputation methods that lead to 
unbiased or nearly unbiased estimators of the total or mean 
of the study variable at a given time point. Our methods do 
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not require any parametric model on the joint distribution of 
the variables across time points or the PSI. They are based 
on regression models under different nonresponse patterns 
derived from the past-data-dependent PSI. Three regression 
methods are adopted, linear regression, kernel regression, 
and one-dimensional index kernel regression. The 
imputation method based on the kernel type regression is 
asymptotically valid, but it requires a large number of 
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observations in each nonresponse pattern. The imputation 
method based on linear regression is asymptotically biased 
when the linear relationship does not hold, but it is more 
stable and, therefore, it may still out-perform methods based 
on kernel regression. 

The method of censoring, which discards all observed 
data from a subject after its first missing value, may work 
well when the number of data discarded is small; otherwise 
it may be very inefficient especially when 7 is large. For 
the SIRD data analysis in Sections 3.2-3.3, censoring is 
comparable with the proposed linear regression imputation 
method. However, the results are based on four years of data 
only and censoring may lead to inefficient estimators when 
more years of data are considered. In applications, it may be 
a good idea to compare estimators based on censoring with 
those based on the proposed methods. 

Estimators based on the linear regression imputation 
methods (8) and (9) described in Section 3.1 are asympto- 
tically biased in general. Although they perform well in the 
simulation study based on the SIRD population, they have 
poor performance under the simulation setting in Section 
3.1, while the proposed linear regression imputation per- 
forms well. 

The results in Section 2 can be extended to the situation 
where each sample unit has an observed covariate x, at 
time ¢ without missing values. Assumption (1) may be 
modified to include covariates: 


RON, My Ory 5 Opcyy Oy g4a es OF 
ee ee nt is Veen \ On ee Olly be hm Deals 
where X = (X,, ..., X,). Missing components of y, can be 


imputed using one of the procedures in Sections 2.1 -2.3 
with (y,,.... y,) replaced by (),,,..., ¥;,, X;). After all 
missing values are imputed, we can also estimate the 
relationship between y and X using some popular ap- 
proaches such as the generalized estimation equation ap- 
proach. Some details can be found in Xu (2007). 

It is implicitly assumed throughout the paper that the y- 
values are continuous variables with no restriction. When y- 
values have a particular order or are integer valued, the 
proposed regression imputation methods are clearly not 
suitable. New methods for these situations have to be 
developed. 
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Appendix 


Proof of (2) - (3). Let L(&) denote the distribution of € 
and L(& | ¢) denote the conditional distribution of € given 
€y Uetty ayes ye). and Ome (6,526)) bThenioboth 
(2) and (3) follow from L(y, | y,»8,) = LO;1¥,-1 
oa) ‘- L(y, 5,1) BLY ray, 5,1) ms [L(6,_, ve On5) 3 
L(6,_, LY 5, ILO, Lae 075) Ti L(y, Yip 5,5) = 
LAV NW, 0.2) = oy V7, ). where. the: first and 
third equalities follow from assumption (1). 

Proof of (5). Using the same notation as in the proof of 
(2) andy letting. »A, =/1\, be: the indicator of, 6; =2::= 
o=",cwe shave L(y, y,, A, = 176-,, = 0,5, = 0) = 
LE Ge a0 reye Seto 0) 10.) 0 7A. = 
1,6, = 0) 2 Galy_ Ae = 1,.0,= 0), which. is equal to 
L(y, | y,-A, = 1,6, = 0) by (1). Similarly, we can show 
that L(y, be A, oi On salle 5, <j 0) a L(y, ye A, ; 
l,6,=0), Henee,, LO \y., A, =1,6,., =0,6, = 9) 
L(y, |y,.A, = 1,6,,, = 1,56, = 0) and result (5) follows. 

An example in which (4) does not hold. To show that (4) 
does not hold in general, we only need to give a counter- 
example. Consider T = 3. Let (),,y,,y;) be jointly 
normal with E(y,) = 0, var(y,) = 1, ¢ =1,2,3, cov(y,, 
Y>) =cov(y,, ¥3) = p, and cov(y,, y;) = p*, where p # 
Q is a parameter. Suppose that y, is always observed and 
P(0, = Ol ye) tas wb. y,). ¢ = 2,3, where a, 
and b, are parameters, @ is the cumulative distribution 
function of the standard normal distribution. Then, E(7; | 
Yoo Vi) = PY, EO. |W) = PY, and E(y3|y%) = py. 
Note that 


Esp Og 0,05 = 6) = 1) = £03 13,03 = 
= E(y3| 93 = 9) 


= [y,L04 19.5; = 0) dy, 


Gros= 11) 


a Jr JLO, | Vis V2, 93 = OL (V2 | V1, 83 = 0) dy. dy; 
= [[ L031 %)LOr| 283 = dy, dy, 


= ([£0sl¥2) 4y3)LOr ly 8s = Ody, 


p| yL(721 4,85 = 0) dy, 


‘a p| PS, = 0/ ¥,) L021) dy. 
[PO = Oly )LOr1¥) dy, 


p| v.® (a, + byy,) L(y |) yz 
[e@ ks b,y,) L(y, |») dy, 
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where the first equality holds because y, is always ob- 
served, the second equality holds because under (1), 5, and 
y, are independent given y,. The denominator of the 
previous expression is equal to 


a, + b,pPy, 


h(y,) = ® 
aD tee op 


Using integration by parts, we obtain that 


g(%1) = [2 — pPV)O (a +b: y)L21) dz 


lI 


b,(1 - p*)[ ®'(a, + by.) L211) 492 


= by (l= p* rl (a4 bpys) MOF ws dy 
2n,/1- 2 ; ; 
s De) 


2(1-p°) 
Tl +b, (rp 


2[1+b;(1- p*)] 


Thus, 
PU yo; = 0,0, 50, 21) = py EP g(vi) _ (10) 
h(y,) 
However, 


5 de 1) in Oaaar el 
= E(y3|)) = 2 Y- 


E(y;3 ly> 6 


This shows that (4) does not hold in this special case. 
Proof of (8). Using the notation in the proof of (2)-(3) 


and writing the (¢ —2)-dimensional vector (),..., V,1 
Vrs V1) aS u,,, we obtain that 
LO6,41 a l | Vp Zs A, 150; ~ 0) 

< PEC, = lly, Z 19 Uy» A, es Sie x 0) 


PUR ye A el oie 0) 
= LO, ied Ve ce) Na) 

UG vee ae O_O) 
= biped 


ORS 


ag, 1 Oy = Oa 


= TPS 


r+] 
[£@, yl v.24, =1,8,40an,, 


= PAG wees lee te Kee Le 


r+] 
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where the second equality follows from assumption (1) and 
the fact that there is a one-to-one function between 
(z,,u,,) and (),,...,¥,,), and the third equality follows 
from assumption (7). Similarly, L(6,.,,=1|z,,A,= 1,6,= 


0) = L(6,,,=1|z,,A, =1) and) hence Gy mye 
ziyhe 18,0) DOhy =f zean= 7 oes) aiken, 
Eye ee oO) 


= L(y, Zi Bon eS iN 5, < 0) 
a 1s Se) 


_ LG =N Ip 254,=18,=0) LO, 2 4,= 1620) 
E(Se. oh) zy ASN 5:0) shtemihgasjeaO) 
=1L(,|z,,A, = re =u) 


Simlariye 1(y| 2, 1.0 
[> 0, =U ag LICtiCes 
L(y, | Zi5 age 


oa SO. — Uy ay, leet (ae 
L(y, rg =1,8,,,= 0,6, =0)= 
= ()) and result (8) follows. 
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Some theory for propensity-score-adjustment 
estimators in survey sampling 


Jae Kwang Kim and Minsun Kim Riddles ' 


Abstract 


The propensity-scoring-adjustment approach is commonly used to handle selection bias in survey sampling applications, 
including unit nonresponse and undercoverage. The propensity score is computed using auxiliary variables observed 
throughout the sample. We discuss some asymptotic properties of propensity-score-adjusted estimators and derive optimal 
estimators based on a regression model for the finite population. An optimal propensity-score-adjusted estimator can be 
implemented using an augmented propensity model. Variance estimation is discussed and the results from two simulation 


studies are presented. 


Key Words: Calibration; Missing data; Nonresponse; Weighting. 


1. Introduction 


Consider a finite population of size N, where N is 
known. For each unit i, y, is the study variable and x, is 
the qg-dimensional vector of auxiliary variables. The para- 
meter of interest is the finite population mean of the study 
variable, 6 = N'Y", y,. The finite population F, = 
{(X15 1 )> (X55 V2 )> ++ (Xo Vy} is assumed to be a random 
sample of size N from a superpopulation distribution 
F(x, y). Suppose a sample of size 1 is drawn from the 
finite population according to a probability sampling design. 
Let w, = 7,' be the design weight, where 7, is the first- 
order inclusion probability of unit 7 obtained from the 
probability sampling design. Under complete response, the 
finite population mean can be estimated by the Horvitz- 
Thompson (HT) estimator, Our = N"Yc4W,¥; Where A 
is the set of indices appearing in the sample. 

In the presence of missing data, the HT estimator 6,,, 
cannot be computed. Let 7 be the response indicator vari- 
able that takes the value one if y is observed and takes the 
value zero otherwise. Conceptually, as discussed by Fay 
(1992), Shao and Steel (1999), and Kim and Rao (2009), the 
response indicator can be extended to the entire population 
ASIC, IHN, 1551.57 where, 79 ist ay realization.of the 
random variable r. In this case, the complete-case (CC) 
estimator 0-¢ = DiesW, 4 Y;/Diesw;r, converges in prob- 
ability to E(Y |r = 1). Unless the response mechanism is 
missing completely at random in the sense that E(Y | r = 
1) = E(Y), the CC estimator is biased. To correct for the 
bias of the CC estimator, if the response probability 


Priv =1| x, y) (1) 


is known, then the weighted CC estimator one = 
N'Y jc4w;,', y;/ p(X, y;) can be used to estimate 0. Note 
that Oycc is unbiased because E{X,.4w,r,y,/ p(x; y;) | 
Fy} = EXIT, ¥;! Ps i) | Fy} = Ly. 


Pw) = 


If the response probability (1) is unknown, one can pos- 
tulate a parametric model for the response probability 
P(x, y; 0) indexed by 9 €Q such that p(x, y) = p(x, 
ys o)) for some 9) € 92. We assume that there exists a 
/n-consistent estimator of , such that 


/n ~ $5) = 0,0, (2) 


where g,, = O,,(1) indicates g,, is bounded in probability. 
Using 6, we can obtain the estimated response probability 
bYWP sees Vs o), which is often called the propensity 
score (Rosenbaum and Rubin 1983). The propensity-score- 
adjusted (PSA) estimator can be constructed as 


f 1 Yr. 
a Wi = VEC (3) 
N 2 PD; 


1 


The PSA estimator (3) is widely used. Many surveys use 
the PSA estimator to reduce nonresponse bias (Fuller, 
Loughin and Baker 1994; Rizzo, Kalton and Brick 1996). 
Rosenbaum and Rubin (1983) and Rosenbaum (1987) 
proposed using the PSA approach to estimate the treatment 
effects in observational studies. Little (1988) reviewed the 
PSA methods for handling unit nonresponse in survey 
sampling. Duncan and Stasny (2001) used the PSA ap- 
proach to control coverage bias in telephone surveys. 
Folsom (1991) and Iannacchione, Milne and Folsom (1991) 
used a logistic regression model for the response probability 
estimation. Lee (2006) applied the PSA method to a 
volunteer panel web survey. Durrant and Skinner (2006) 
used the PSA approach to address measurement error. 

Despite the popularity of PSA estimators, asymptotic 
properties of PSA estimators have not received much 
attention in survey sampling literature. Kim and Kim (2007) 
used a Taylor expansion to obtain the asymptotic mean and 
variance of PSA estimators and discussed variance esti- 
mation. Da Silva and Opsomer (2006) and Da Silva and 
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Opsomer (2009) considered nonparametric methods to 
obtain PSA estimators. 

In this paper, we discuss optimal PSA estimators in the 
class of PSA estimators of the form (3) that use a ./n- 
consistent estimator $. Such estimators are asymptotically 
unbiased for 9. Finding minimum variance PSA estimators 
among this particular class of PSA estimators is a topic of 
major interest in this paper. 

Section 2 presents the main results. An optimal PSA 
estimator using an augmented propensity score model is 
proposed in Section 3. In Section 4, variance estimation of 
the proposed estimator is discussed. Results from two 
simulation studies can be found in Section 5 and concluding 
remarks are made in Section 6. 


2. Main results 


In this section, we discuss some asymptotic properties of 
PSA estimators. We assume that the response mechanism 
does not depend on y. Thus, we assume that 


Pry = Lh, y= Br@ = Lie)? Godon -A) 


for some unknown vector @ . The first equality implies that 
the data are missing-at-random (MAR), as we always ob- 
serve x in the sample. Note that the MAR condition is 
assumed in the population model. In the second equality, we 
further assume that the response mechanism is known up to 
an unknown parameter ). The response mechanism is 
slightly different from that of Kim and Kim (2007), where 
the response mechanism is assumed to be under the classical 
two-phase sampling setup and depends on the realized 
sample: 


11x, 7 =1) = p(x; $4). (5) 


Here, / is the sampling indicator function defined through- 
out the population. That is, J; =1 if i¢A and J, = 0 
otherwise. Unless the sampling design is non-informative in 
the sense that the sample selection probabilities are corre- 
lated with the response indicator even after conditioning on 
auxiliary variables (Pfeffermann, Krieger and Rinott 1998), 
the two response mechanisms, (4) and (5), are different. In 
survey sampling, assumption (4) is more appropriate be- 
cause an individual’s decision on whether or not to respond 
to a survey is at his or her own discretion. Here, the re- 
sponse indicator variable r, is defined throughout the popu- 
lation, as discussed in Section 1. 

We consider a class of ./n-consistent estimators of @ , 
in (4). In particular, we consider a class of estimators which 
can be written as a solution to 


U,(o) = )wfr - p,(o)}h,(o) = 0, (6) 


i€A 


Pr(r =1| x, y, J =1) = Pr(r 
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where p,(d) = p(x,; >) for some function h,(d) = h(x;; 
o), a smooth function of x, and parameter o. Thus, the 
solution to (6) can be written as ,, which depends on the 
choice of h,(). Any solution 6, to (6) is consistent for , 
in (4) because E{U,,(5)| Fy} =£LE M7 — Po} 1, (0) | 
F,,] is zero under the response mechanism in (4). If we 
drop the sampling weights w, in (6), the estimated para- 
meter >, is consistent for o) in (5) and the resulting PSA 
estimator is consistent only when the sampling design is 
non-informative. The PSA estimators obtained from (6) 
using the sampling weights are consistent regardless of 
whether the sampling design is non-informative or not. 
According to Chamberlain (1987), any ./-consistent esti- 
mator of , in (4) can be written as a solution to (6). Thus, 
the choice of h,(@) in (6) determines the efficiency of the 
resulting PSA estimator. 

Let ae be the PSA estimator in (3) using p; = 
p.(d,) with $, being the solution to (6). To discuss the 
asymptotic properties of Ba Aj» assume a sequence of finite 
populations and samples, as in Isaki and Fuller (1982), such 
that Yj.,w,u, - Xu, = O(n"? N) for any population 
characteristics u, with bounded fourth moments. We also 
assume that the sampling weights are uniformly bounded. 
That is, K, < N ‘nw, < K, for all i uniformly in 2, 
where K, and K, are fixed constants. In addition, we as- 
sume the following regularity conditions: 


[Cl] The response mechanism satisfies (4), where 
p(x;) is continuous in ® with continuous first 
and second derivatives in an open set containing 
). The responses are independent in the sense 
that Cov(n, r,| x) =0 for iz j. Also, p(x;; 
) > c forall i for some fixed constant c > 0. 

[C2] The solution to (6) exists and is unique almost 
everywhere. The function h,(@) = h(x,; 0) in (6) 
has a bounded fourth moment. Furthermore, the 
partial derivative 0{U,()}/0 is nonsingular 
for all n. 

[C3] The estimating function U,(o) in (6) converges 
in probability to U,(o) = XL — p;()}h,(O) 
uniformly in o. Furthermore, the partial deriv- 
ative a4U ,(d)} /O@ converges in probability to 
O{U,(o)} /0o uniformly in o. The solution 9, 
to U,() = 0 satisfies N"°(by —)) = O,() 
under the response mechanism. 


Condition [C1] states the regularity conditions for the 
response mechanism. Condition [C2] is the regularity condi- 
tion for the solution b, to (6). In Condition [C3], some 
regularity conditions are imposed on the estimating function 
U ,(@) itself. By [C2] and [C3], we can establish the 

n-consistency (2) of ,. 
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Now, the following theorem deals with some asymptotic 
properties of the PSA estimator 0)¢, ,,. 


Theorem I If conditions [C1] - [C3] hold, then under the 
joint distribution of the sampling mechanism and_ the 
response mechanism, the PSA estimator ® ps ap Satisfies 


in Crane = Perel TOL), (7) 
where 
Obes a > ip Y, = oe Pi hi rh (8) 
icA i 


Ya = (iG Z; Py) (LG, ¥), P= P(%)3%)s 2;= 
O{p '(x,3 b))} (06, and h, = h(x,; 6)). Moreover, if the 
finite population is a random sample from a superpopu- 
lation model, then 


V0.8, 3) 2V,= V Our) 


— —| |V(Y 9 
held (- he 


The equality in (9) holds when db, satisfies 


I; 
Ww, ——— 
2 en b,) 


where E(Y | x,) is the conditional expectation under the 
superpopulation model. 


Proof. Given po) = p(x;;) and h,() = h(x,; 6), 
define 


0(, 7) = 
N'Y w, nemo — 


HOT y= (10) 


(o)h’ 
® {y, — p;(o)h;() y} 


Since 6, satisfies (6), we have oy, = 0(6,, y) for any 
choice of _. We now want to find a particular choice of jy, 
say y , such that 


6(b,. 7) = 9(b, 7°) +0,(0"”). 


icA 


(11) 


As b, converges in probability to », the asymptotic 
equivalence (11) holds if 


E,—9 
Beso = 
using the theory of Randles (1982). Condition (12) holds if 
Y = Yj. where y,, is defined in (8). Thus, (11) reduces to 


(12) 


bn dee a — >, |p Yh poe: pwr} 


i€A i 


(13) 


=1/2 
0,(n ), 
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which proves (7). The variance of 8p, , can be derived as 
V esa, n) 


] l 
= = HO) + Fre |S [+ -1 Joy - pihy,)- i 
N° iéA P; 
. i | wail 
= V (Our) + tral [-. - i} = E(Y|x;) 


i€A P; 
RM eapy \? | 


l l 
= VOur) gales 3x WN? S04 -. FS i\raris } 


icA i 


' se] St (= : eC = psa? | (14) 
N* ieA Pi 
where the last equality follows because y, is conditionally 
independent of E(Y|x,)— p,hy,, conditioning on x,. 
Since the last term in (14) is non-negative, the inequality in 
(9) is established. Furthermore, if E(Y | x,) = p, ha for 


some Q, then (10) holds and E(y, LXer ls Olea Ie 
definition of y,. Thus, E(Y | x,)— phy, =—ph'{y, - 
E(y, | x,)} = 0,(1), implying that the last term in (14) is 
negligible. 


In (9), V, is the lower bound of the asymptotic variance 
of PSA estimators of the form (3) satisfying (6). Any PSA 
estimator that has the asymptotic variance V, in (9) is 
optimal in the sense that it achieves the lower bound of the 
asymptotic variance among the class of PSA estimators with 
satisfying (2). The asymptotic variance of optimal PSA 
estimators of 0 is equal to V, in (9). The PSA estimator 
using the maximum likelihood estimator of @, does not 
necessarily achieve the lower bound of the asymptotic 
variance. 

Condition (10) provides a way of constructing an optimal 
PSA estimator. First, we need an assumption for E(Y | x), 
which is often called the outcome regression model. If the 
outcome regression model is a linear regression model of 
the form E(Y | x) = B, +B, x, an optimal PSA estimator 
of 6 can be obtained by solving 


>» pe i x,) = ow, (Lx 


15) 
icA D; (0) ic A 


Condition (15) is appealing because it says that the PSA 
estimator applied to y = a+b’x leads to the original HT 
estimator. Condition (15) is called the calibration condition 
in survey sampling. The calibration condition applied to x 
makes full use of the information contained in it if the study 
variable is well approximated by a linear function of x. 
Condition (15) was also used in Nevo (2003) and Kott 
(2006) under the linear regression model. 
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If we explicitly use a regression model for E(Y | x), it is 
possible to construct an estimator that has asymptotic 
variance (9) and is not necessarily a PSA estimator. For 
example, if we assume that 


E(Y | x) = m(x; Bo) (16) 
for some function m(x; -) known up to B,, we can use the 
model (16) directly to construct an optimal estimator of the 
form 


= aa) m(X;; B) 


; ( B)} > Le 
Ne Oh mex: Bs a7 


where 8 is a ./n-consistent estimator of B, in the super- 
population model (16) and @ is a ./n-consistent estimator 
of @) computed by (6). The following theorem shows that 
the optimal estimator (17) achieves the lower bound in (9). 


Theorem 2 Let the conditions of Theorem I hold. Assume 
that B satisfies B = Cae Ores *). Assume that, in the 
superpopulation model (16), m(x; B) has continuous first- 
order partial derivatives in an open set containing Bo. 
Under the joint distribution of the sampling mechanism, the 
response mechanism, and the superpopulation model (16), 
the estimator 6, in (17) satisfies 


Boo) = o,(1), 


pt 
n(0 he 


where 


ieA 


050 feet Em] mon )+ Ey, — mors Pot 


D; = P;(%_), and V (8,1) is equal to V, in (9). 


Proof. Define 8..(B, >) =N_'Zieawilin(x;; B) + 7; 
(>) {y; — m(x,; B)}]. Note that ®,,, in (17) can be written 


pt 


5 i eee Sel 7g (2 o). Since 
a aml o) = ante oxi B)- miss 


where m(x,; B) = Om(x,; B)/ OB, and 


= 8, (B. o) = N 7" i Z;(o) {y; = m(X;; B)}. 


where 2,(o) = O{p,'(o)} / 0b, we have E[d {80 (B,o)}/ 
O(B, 0) | B = By, = 0] = 0 and the condition of Randles 
(1982) is satisfied. Thus, 


6 A A 


Bone ( (B, 6) = 8 ax (Bos )) + og(urih) = che + anni?) 


and the variance of 0... is equal to V,, the lower bound of 


opt 
the asymptotic variance. 
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The (asymptotic) optimality of the estimator in (17) is 
justified under the joint distribution of the response model 
(4) and the superpopulation model (16). When both models 
are correct, 0, , 18 optimal and the choice of (B, 6) does 
not affect the efficiency of the om as long as (f, 6) is 

/n-consistent. Robins, Rotnitzky and Zhao (1994) also ad- 
vocated using 6 in (17) under simple random 


sampling. 


opt 


Remark I When the response model is correct and the 
superpopulation model (16) is not necessarily correct, the 
choice of § does affect the efficiency of the optimal esti- 
mator. Cao, Tsiatis and Davidian (2009) considered opti- 
mal estimation when only the response model is correct. 
Using Taylor linearization, the optimal estimator in (17) 
with @ satisfying (6) is asymptotically equivalent to 


6(B) = 


i€A i 


poi ms: B)+ el — m(x;; B)} - a - 5 no) 


where Ks is the probability limit of €g3= (LieaWi NZ; (9) 


Bib. (O)}  DicawiZ,(O)y; — m(x,;B)} and z,(o)= O{p; | 
(o)} / 0d. The asymptotic variance is then equal to 


V{6(B)} = 


— Pi 5 


iéA i 


Hn) +E) Ly, nes Py, pee 4) 


Thus, an optimal estimator of B can be computed by 
finding B that minimizes 


0.8) = Dwr 


iéA p; 


Pity ~ m(x;; B) — h, (6)}?. 


The resulting estimator is design-optimal in the sense that it 
minimizes the asymptotic variance under the response 
model. 


3. Augmented propensity score model 


In this section, we consider optimal PSA estimation. 
Note that the optimal estimator ae in (17) is not neces- 
sarily written as a PSA estimator form in (3). It is in the 
PSA estimator form if it satisfies 3.4 w, 7, p;' m(x,; f= 
Dies W;, M(X;5 B). Thus, we can construct an optimal PSA 
estimator by including m(x;,; 8) in the model for the pro- 
pensity score. Specifically, given m, = m(x;; B), pis 
p,(o) and h. = h, (0), where b is obtained from (6), we 
augment the response model by 


pigod) 2 a 


: we AES 
Py © C= p:) apo aw ae) 
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where A = (Ay, A,)’ is the Lagrange multiplier which is 
used to incorporate the additional constraint. If (A,,,)' = 
0, then p, (o, i.) = p;. The augmented response probabi- 
lity p, ( (o, 4) always takes values between 0 and 1. The 
augmented response probability model (18) can be derived 
by minimizing the Kullback- Leibler distance >). 4; /, q, 
log(q;/q,), where q;= (1—p;)/p, and q,= (-1,)/p, 
subject to the constraint }j.4,w,(% / pp, )(, m;) = Dies W; 
(sits) 
Using (18), the optimal PSA estimator is computed by 


A x 1 i; 
0 =e ——————— ie 19) 
roa Wee A)” 


where A satisfies 


i 
w,——— (1, m,) = w, (1, m,). (20) 
2 P; (9, d) i¢A 


Under the response model (4), it can be shown that 


aie a ae (21) 


Furthermore, by the argument for Theorem 1, we can 
establish that 


ieA 


- renier. 
Qosq = lay ft +b, Mm; + Vp Dh, 


r r ’ 

5 —(y, = by = 6, M;— Yao P; »)| 
2; 

+ Onn), 


where (b,,5,.Y;,2) is the probability limit of (5, 7',.) 
with 


12 in = 1D rz, ()p; hi @ 


i€A 


Y w,7,Z,(6) (9, — by — bm) (22) 


iéA 


and the effect of estimating @, in p, = p(x;; ) can be 
safely ignored. 

Note that, under the response model (4), (6, r) in (19) 
converges in probability to (),0), where ) is the true 
parameter in (4). Thus, the propensity score from the aug- 


mented model converges to the true response probability. 
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Because A converges to zero in probability, the choice of 
p inom = m(,5 6) does not play a role for the asymptotic 
unbiasedness of the PSA estimator. The asymptotic vari- 
ances are changed for different choices of . 

Under the superpopulation model (16), by + b, mM, => 
E(Y | x,) in probability. Thus, the optimal PSA estimator 
in (19) is asymptotically equivalent to the optimal estimator 
in (17). Incorporating m, into the calibration equation to 
achieve optimality is close in spirit to the model-calibration 
method proposed by Wu and Sitter (2001). 


4. Variance estimation 


We now discuss variance estimation of PSA estimators 
under the assumed response model. Singh and Folsom 
(2000) and Kott (2006) discussed variance estimation for 
certain types of PSA estimators. Kim and Kim (2007) 
discussed variance estimation when the PSA estimator is 
computed with the maximum likelihood method. 

We consider variance estimation for the PSA estimator 
of the form (3) where p, = p,() is constructed to satisfy 
(6) for some h,(o) = h(x;; o, B), where f is obtained 
using the postulated superpopulation model. Let B be the 
probability limit of B under the response model. Note that 
B is not necessarily equal to B, in (16) since we are not 
assuming that the postulated superpopulation model is 
correctly specified in this section. 

Using the argument for the Taylor linearization (13) used 
in the proof of Theorem 1, the PSA estimator satisfies 


ym 11; (bo, B’) + 0,(n™”), (23) 


N ica 


where 


n,(o, B) = p.(o)h' (6, B) Y;, 


ES ORO BY 804) 


P;() 

h,(o, B) = h(x,; 0, B) and y, is defined as in (8) with h, 
replaced by h, ; (o> B ). Since p;() satisfies (6) ait 
h,(>) = h(x;; 9, B), 055A ae Dee, Ni (9, B) PoRs and 
the linearization in (23) can be expressed as N'Y). , 
w;1;(@, B) = N7'Xiea¥, 1: (bo B') + 0,(7°'). Thus, if 
(x,, ¥; 1%) are independent and identically distributed 
(IID), then 7,(,, B ) are IID even though n (d, 8) are not 
necessarily IID. Because n,($,, B ) are IID, we can apply 
the standard eomiuets sample method to estimate the vari- 
ance Of Nay = Ne Dies W; 1; (Yo; B ), which is PON, 
cally equivalent to the variance of 6,., = N'DicaW, 
N; (6, B). See Kim and Rao (2009). 

To derive the variance estimator, we assume that the 
variance estimator V = Nae Og gg, Satisties 
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V/V (Sur| Fy) =1+0,(1) for some Q, related to the 
joint inclusion eaDbability. where $47 = N Dies, g; for 
any g with a finite second moment and V(gy7;| Fy) = 
N°>N 3D. Qy.; 8 & > for some Qy,;. We also assume 
that 


Dual = ON). (25) 


To obtain the total variance, the reverse framework of 
Fay (1992), Shao and Steel (1999), and Kim and Rao (2009) 
is considered. In this framework, the finite population is first 
divided into two groups, a population of respondents and a 
population of nonrespondents. Given the population, the 
sample A is selected according to a probability sampling 
design. Thus, selection of the population respondents from 
the whole finite population is treated as the first-phase 
sampling and the selection of the sample respondents from 
the population respondents is treated as the second-phase 
sampling in the reverse framework. The total variance of 
Ayr can be written as 


VQ en| Sn) = ty, = EV Qarl Fx» Ry) | Fv 


+ ViE (hur | Fy» Ry) | Ft 26) 


The conditional variance term V(1y7| Fy, Fey) in (26) can 
be estimated by 


LesiNa oy De OQuhiiys (27) 


i€A jEeA 


where f, = n,(6, B) is defined in (24) with y, replaced by 
a consistent estimator such as 7, = {Die AW TE; (b)p D; 
h’}'y.4w,7.2z,(b)y, and h, = h(x,; 6, B). To show 
that V, is also consistent for V, in (26), it suffices to show 
that V{n-V (fpr! Fy, Ry) | Fy} = 0(1), which follows by 
(25) and the existence of the fourth moment. See Kim, 
Navarro and Fuller (2006). The second term V, in (26) is 


N 
ViEur | Fras Ry) | Frys y(n sina F.] 
= 


= 1 N ty oo h. ny? 
N2 = ve P; OVE > 


i 


where h, = h(x,; 6), B ). A consistent estimator of V, can 
be derived as 


l Lip; Pare 
I, = WO, Be 8) 
N ieA i 
where y,, is defined after (27). Therefore 
V (psa) =V,+V,, (29) 
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is consistent for the variance of the PSA estimator defined in 
(3) with p, = p,(o) satisfying (6), where V, is in (27) and 
V, is in (28). 

Note that the first term of the total variance is V, = 
0; (n''), but the second term i K,s=1Og (ND '). Thus, 
ie the sampling fraction nN is néelisible, that is, 
nN~'= o(1), the second term V, can be ignored and J, is 
a consistent estimator of the total variance. Otherwise, the 
second term V, should be taken into consideration, so that a 
consistent variance estimator can be constructed as in (29). 


Remark 2 The variance estimation of the optimal PSA 
estimator with augmented propensity model (18) with (6, 
r) satisfying (20) = be derived by (29) using j= by a 
bm,+7,bh, +r.p)'(y,- b= bm, —,.P,h,) where (by, 
é) and Y,. are defined in (21) and (22), respectively. 


5. Simulation study 


5.1 Study one 


Two simulation studies were performed to investigate the 
properties of the proposed method. In the first simulation, 
we generated a finite population of size N = 10,000 from 
the following multivariate normal distribution: 


x, 2\4(11* OO 
% weN || LOSiek a0 
e ON 10 Ox 4 


The variable of interest y was constructed as y = 1+ x, + 
e. We also generated response indicator variables , 
independently from a Bernoulli distribution with probability 


_ exp + X,,) 
fan tctexp( Qin): 
From the finite population, we used simple random sam- 
pling to select two samples of size, n = 100 and n =400, 
respectively. We used B = 5,000 Monte Carlo samples in 
the simulation. The average response rate was about 69.6%. 

To compute the propensity score, a response model of 

the form 

exp (0 BI oF x5) (30) 


PUA I exp(O) FOpx,) 


was postulated and an outcome regression model of the 
form 


m(x; B) = By + B, x, (31) 


was postulated to obtain the optimal PSA estimators. Thus, 
both models are correctly specified. Bam each sample, we 
computed four estimators of 0 = N'>%, y,: 
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1. (PSA-MLE): PSA estimator in (3) with p, = p,() 
and being the maximum likelihood estimator 
of 9. 

2. (PSA-CAL): PSA estimator in (3) with p, satisfying 
the calibration constraint (15) on (1, x,;). 

3. (AUG): Augmented PSA estimator in (19). 

4. (OPT): Optimal estimator in (17). 


In the augmented PSA estimators, @ was computed by 
the maximum likelihood method. Under model (30), the 
maximum likelihood estimator of = (@ ,,)' was 
computed by solving (6) with h,(@) = (1, x,,). Parameter 
(6,,8,) for the outcome regression model was computed 
using ordinary least squares, regressing y on x, In 
addition to the point estimators, we also computed the 
variance estimators of the point estimators. The variance 
estimators of the PSA estimators were computed using the 
pseudo-values in (24) and the h,(@) corresponding to each 
estimator. For the augmented PSA estimators, the pseudo- 
values were computed by the method in Remark 2. 

Table 1 presents the Monte Carlo biases, variances, and 
mean square errors of the four point estimators and the 
Monte Carlo percent relative biases and f-statistics of the 
variance estimators of the estimators. The percent relative 
bias of a variance estimator V(8) is calculated as 100 x 
Vuic(9)} | [Enc (8) —Varc(8)] where Ey,-(-) and 
V\ic() denote the Monte Carlo expectation and the Monte 
Carlo variance, respectively. The f-statistic in Table | is the 
test statistic for testing the zero bias of the variance 
estimator. See Kim (2004). 

Based on the simulation results in Table 1, we have the 
following conclusions. 


1. All of the PSA estimators are asymptotically un- 
biased because the response model (30) is correctly 
specified. The PSA estimator using the calibration 
method is slightly more efficient than the PSA esti- 
mator using the maximum likelihood estimator, be- 
cause the last term of (14) is smaller for the calibra- 
tion method as the predictor for E(Y | x,) = By + 
B,x,, is better approximated by a linear function of 
(1, x,,) than by a linear function of (p,, p; X>;). 


Table 1 
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2. The augmented PSA estimator is more efficient than 
the direct PSA estimator (3). The augmented PSA 
estimator is constructed by using the correctly speci- 
fied regression model (31) and so it is asymptotically 
equivalent to the optimal PSA estimator in (17). 

3. Variance estimators are all approximately unbiased. 
There are some modest biases in the variance esti- 
mators of the PSA estimators when the sample size is 
small (7 = 100). 


5.2 Study two 


In the second simulation study, we further investigated 
the PSA estimators with a non-linear outcome regression 
model under an unequal probability sampling design. We 
generated two stratified finite populations of (x, y) with 
four strata (A = 1, 2,3, 4), where x,, were independently 
generated from a normal distribution N(1, 1) and y,, were 
dichotomous variables that take values of 1 or 0 from a 
Bernoulli distribution with probability p,,,; OF D»,,,;- Two 
different probabilities were used for two populations, 
respectively: 


1. Population 1 (Pop1): 
Pig 114 at exp.) 5. 


2. Population 2 (Pop2): 


Pye nel erexp( 0.25 Glee Mh Sha 4.5% |) 


In addition to x,, and y,,, the response indicator vari- 
ables 7,, were generated from a Bernoulli distribution with 
probability py, — 17 {1 -F expl_l.5- 0.7x,,);. ‘The sizes 
OF the fouresttava were, JV, — LOU0, IN, —=2,00U.. JV. — 
3,000, and NV, =4,000, respectively. In each of the two sets 
of finite population, a stratified sample of size n =400 was 
independently generated without replacement, where a 
simple random sample of size n, = 100 was selected from 
each stratum. We used B = 5,000 Monte Carlo samples in 
this simulation. The average response rate was about 67%. 


Monte Carlo bias, variance and mean square error(MSE) of the four point estimators and percent relative biases (R.B.) and 
t-statistics(t-stat) of the variance estimators based on 5,000 Monte Carlo samples 


A 


n Method f) v6) 
Bias Variance MSE R.B. (%) t-stat 
100 (PSA-MLE) 0.01 0.0315 0.0317 2.34 Sel? 
(PSA-CAL) -0.01 0.0308 0.0309 3.56 -1.70 
(AUG) 0.00 0.0252 0.0252 0.61 0.30 
(OPT) 0.00 0.0252 0.0252 -0.21 -0.10 
400 (PSA-MLE) -0.01 0.00737 0.00746 0.35 0.17 
(PSA-CAL) -0.01 0.00724 0.00728 0.29 0.14 
(AUG) 0.00 0.00612 0.00612 0.07 0.03 
(OR) 0.00 0.00612 0.00612 0.14 0.07 
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To compute the propensity score, a response model of 
the form 


exp(, + , x) 


adeningget in SHC RG 2 


was postulated for parameter estimation. To obtain the 
augmented PSA estimator, a model for the variable of 
interest of the form 


Spada Pu) 


bavexp(BaetiPaa) es 


i 8 ce 


was postulated. Thus, model (32) is a true model under 
(Pop1), but it is not a true model under (Pop2). 
We computed four estimators: 


1. (PSA-MLE): PSA estimator in (3) using the 
maximum likelihood estimator of 0. 

2. (PSA-CAL): PSA estimator in (3) with p, satisfying 
the calibration constraint (15) on (1, x). 

3. (AUG-1): Augmented PSA estimator 65,, in (19) 
with $B computed by the maximum likelihood 
method. 

4. (AUG-2): Augmented PSA estimator 655 ay 1D) 
with 8 computed by the method of Cao et al. (2009) 
discussed in Remark 1. 


We considered the the augmented PSA estimator in (19) 
with p, = p;(o), where is the maximum likelihood esti- 
mator of o. The first augmented PSA estimator (AUG-1) 
used m, = m(x;; 6) with B found by solving Se Dae 
We Vig Vil eo. P)ales tin) Ooawoere A, 1s the set of 
indices appearing in the sample for stratum h and w,, is 
the sampling weight of unit 7 for stratum /h. 

Table 2 presents the simulation results for each method. 
In each population, the augmented PSA estimator shows 
some improvement comparing to the PSA estimator using 
the maximum likelihood estimator of o or the calibration 
estimator of @ in terms of variance. Under (Pop1), since 
model (32) is true, there is essentially no difference between 


Table 2 


the augmented PSA estimators using different methods of 
estimating B. However, under (Pop2), where the assumed 
outcome regression model (32) is incorrect, the augmented 
PSA estimator with 8 computed by the method of Cao ef al. 
(2009) results in slightly better efficiency, which is consistent 
with the theory in Remark |. Variance estimates are ap- 
proximately unbiased in all cases in the simulation study. 


6. Conclusion 


We have considered the problem of estimating the finite 
population mean of y under nonresponse using the propen- 
sity score method. The propensity score is computed from a 
parametric model for the response probability, and some 
asymptotic properties of PSA estimators are discussed. In 
particular, the optimal PSA estimator is derived with an 
additional assumption for the distribution of y. The propen- 
sity score for the optimal PSA estimator can be imple- 
mented by the augmented propensity model presented in 
Section 3. The resulting estimator is still consistent even 
when the assumed outcome regression model fails to hold. 

We have restricted our attention to missing-at-random 
mechanisms in which the response probability depends only 
on the always-observed x. If the response mechanism also 
depends on y, PSA estimation becomes more challenging. 
PSA estimation when missingness is not at random is 
beyond the scope of this article and will be a topic of future 
research. 
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Monte Carlo bias, variance and mean square error of the four point estimators and percent relative biases (R.B.) and f¢-statistics of 


the variance estimators, based on 5,000 Monte Carlo samples 


Population Method 
Bias 
Pop] (PSA-MLE) 0.00 
(PSA-CAL) 0.00 
(AUG-1) 0.00 
(AUG-2) 0.00 
Pop2 (PSA-MLE) 0.00 
(PSA-CAL) 0.00 
(AUG-1) 0.00 
(AUG-2) 0.00 
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Osa V (Opsa) 
Variance MSE R.B. (%) t-stat 
0.000750 0.000762 -1.13 -0.57 
0.000762 0.000769 -1.45 -0.72 
0.000745 0.000757 -1.73 -0.86 
0.000745 0.000757 -1.83 -0.91 
0.000824 0.000826 0.29 0.14 
0.000829 0.000835 -0.94 -0.46 
0.000822 0.000823 -0.71 -0.35 
0.000820 0.000821 -0.61 -0.30 
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Assessing the accuracy of response 
propensity models in longitudinal studies 


Ian Plewis, Sosthenes Ketende and Lisa Calderwood ' 


Abstract 


Non-response in longitudinal studies is addressed by assessing the accuracy of response propensity models constructed to 
discriminate between and predict different types of non-response. Particular attention is paid to summary measures derived 
from receiver operating characteristic (ROC) curves and logit rank plots. The ideas are applied to data from the UK 
Millennium Cohort Study. The results suggest that the ability to discriminate between and predict non-respondents is not 
high. Weights generated from the response propensity models lead to only small adjustments in employment transitions. 
Conclusions are drawn in terms of the potential of interventions to prevent non-response. 


Key Words: Longitudinal studies; Missing data; Weighting; Propensity scores; ROC curves; Millennium Cohort 


Study. 


1. Introduction 


Examples of studies that have modelled the predictors of 
different kinds of, and different reasons for the non-response 
that affect longitudinal studies are plentiful, stimulated by 
being able to draw on auxiliary variables obtained from 
sample members before (and after) the occasions at which 
they are non-respondents. See, for example, Lepkowski and 
Couper (2002) for an analysis that separates refusals from 
not being located or contacted; Hawkes and Plewis (2006) 
who separate wave non-respondents from attrition cases in 
the UK National Child Development Study; and Plewis 
(2007a) and Plewis, Ketende, Joshi and Hughes (2008) who 
consider non-response in the first two waves of the UK 
Millennium Cohort Study. The focus of this paper is on how 
we can assess the accuracy of these response propensity 
models (Little and Rubin 2002). The paper is built around a 
framework that is widely used in epidemiology (Pepe 2003) 
and criminology (Copas 1999) to evaluate risk scores but 
has not, to our knowledge, been used in survey research 
before. Response propensity models can be used to con- 
struct weights intended to remove biases from estimates, to 
inform imputations, and to predict potential non-respondents 
at future waves thereby directing fieldwork resources to 
those respondents who might otherwise be lost. The accura- 
cy of response propensity models has not, however, been 
given the amount of attention it warrants in terms of their 
ability to discriminate between respondents and non- 
respondents, and to predict future non-response. Good esti- 
mates of accuracy can be used to compare the efficacy of 
different weighting methods, and to help to determine the 
allocation of scarce fieldwork resources in order to reduce 
non-response. 


The paper is organised as follows. The framework for 
assessing accuracy is set out in the next section. Section 3 
introduces the UK Millenntum Cohort Study and the 
methods are illustrated using data from this study in Section 
4. Section 5 concludes. 


2. Models for predicting non-response 


A typical response propensity model for a_ binary 
outcome (e.g., Hawkes and Plewis 2006) is: 


f(t) = Dp ipin m ay eee, 
ab Sides Sezer t—k (1) 


where 

t,, = E(r,) is the probability of not responding for 

subject i at wave ft; 7, = 0 for a response and | for 

non-response; f is an appropriate function such as 
logit or probit. 

i = l1,..., Where n is the observed sample size at 

wave one. 

t = 1,...,7, where 7, is the number of waves for 

which 7, is recorded for subject 7. 

x,, are fixed characteristics of subject i measured at 

WAVE. ONE.-P == Orne PR, = Ptorall i. 

Sis /_. are time-varying characteristics of subject 7, 
measurediat waves: t—-Kyeg- =-lees- On b= dy 25 cvs; 
often k willbe 1. 
ri7- are time-varying characteristics of the data 
collection process, measured for subject i at waves 
fosipeop en ere heed Oo often-“k= willy be] 
but can be 0 for variables such as number of contacts 
before a response is obtained. 
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Model (1) can easily be extended to more than two re- 
sponse categories such as {response, wave non-response, at- 
trition}. Other approaches are also possible. For example, it 
is often more convenient to model the probability of not 
responding just at wave ¢ = ¢ in terms of variables mea- 
sured at earlier waves ¢ — k, k > 1 or, when there is no 
wave non-response so that non-response has a monotonic 
rather than an arbitrary pattern, to model time to attrition as 
a survival process. 

The estimated response probabilities p,, for ¢ = tf, are 
derived from the estimated non-response probabilities in (1) 
and they can be used to generate inverse probability weights 
g,(=1/p,). These are widely applied (see Section 4.2 for 
an example) to adjust for biases arising from non-response 
under the assumption that data are missing at random 
(MAR) as defined by Little and Rubin (2002). 


2.1 Assessing the accuracy of predictions 


A widely used method of assessing the accuracy of 
models like (1) is to estimate their goodness-of-fit by using 
one of several possible pseudo-R”* statistics. Estimates of 
pseudo -R* are not especially useful in this context, partly 
because they are difficult to compare across datasets but 
also because they assess the overall fit of the model and do 
not, therefore, distinguish between the accuracy of the 
model for the respondents and non-respondents separately. 

As Pepe (2003) emphasises, there are two related compo- 
nents of accuracy: discrimination (or classification) and 
prediction. Discrimination refers to the conditional proba- 
bilities of having a propensity score (s: the linear predictor 
from (1)) above a chosen threshold (c) given that a person 
either is or is not a non-respondent. Prediction, on the other 
hand, refers to the conditional probabilities of being or 


1.00 


SPF 


becoming a non-respondent given a propensity score above 
or below the threshold. 

More formally, let D and D refer to the presence and 
absence of the poor outcome (i.e., non-response) and define 
+ (s > c) and — (s < c) as positive and negative tests 
derived from the propensity score and its threshold. Then, 
for discrimination, we are interested in P(+|D), the true 
positive fraction (TPF) or sensitivity of the test, and 
P(—|D) its specificity, equal to one minus the false 
positive fraction (1 — FPF). For prediction, however, we 
are interested in P(D|+), the positive predictive value 
(PPV) and P(D|-), the negative predictive value (NPV). 
If the probability of a positive test (P(+) = T) is the same 
as the prevalence of the poor outcome (P(D) = p) then 
inferences about discrimination and prediction are essen- 
tially the same: sensitivity equals PPV and specificity equals 
NPV. Generally, however, {TPF, FPF, p} and {PPV, NPV, 
tt convey different pieces of information. TPF can be 
plotted against FPF for any risk score threshold c. This is 
the receiver operating characteristic (ROC) curve (Figure 1). 
Krzanowski and Hand (2009) give a detailed discussion of 
how to estimate ROC curves. The AUC — the area enclosed 
by the ROC curve and the x-axis in Figure | — is of partic- 
ular interest and can vary from | (perfect discrimination) 
down to 0.5, the area below the diagonal (implying no 
discrimination). The AUC can be interpreted as the 
probability of assigning a pair of cases, one respondent and 
one non-respondent, to their correct categories, bearing in 
mind that guessing would correspond to a probability of 0.5. 
A linear transformation of AUC (=2*AUC — 1) — some- 
times referred to as a Gini coefficient and equivalent to 
Somer’s D rank correlation index (Harrell, Lee and Mark 
1996) — is commonly used as a more natural measure than 
AUC because it varies from 0 to 1. 


0.50 0.75 1.00 
PPE 


Figure 1 ROC curve 
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Copas (1999) proposes the logit rank plot as an alter- 
native to the ROC as a means of assessing the predic- 
tiveness of a propensity score. If the propensity score is 
derived from a logistic regression then a logit rank plot is 
just a plot of the linear predictor from the model against the 
logistic transformation of the proportional rank of the 
propensity scores. More generally, it is a plot of logit(p; ) 
where p, is the estimated probability from any form of (1) 
i.e, p(D\x, x*, z), against the logits of the proportional 
ranks (r/n) where r is the rank position of case i(i = 
1, ..., 2) on the propensity score. This relation is usually 
close to being linear and its slope — which can vary from 
zero to one — is a measure of the predictive strength of the 
propensity score. Copas argues that the slope is more 
sensitive to changes in the specification of the propensity 
model, and to changes in the prevalence of the outcome, 
than the Gini coefficient is. A good estimate of the slope can 
be obtained by calculating quantiles of the variables on the 
y and x axes and then fitting a simple regression model. 

The extent to which propensity scores discriminate 
between respondents and non-respondents is one indicator 
of the effectiveness of any statistical adjustments for 
missingness. A lack of discrimination suggests either that 
there are important predictors absent from the propensity 
score or that a substantial part of the process that drives the 
missingness is essentially random. The extent to which 
propensity scores predict whether a case will be a non- 
respondent in subsequent waves — and what kind of non- 
respondent they will be — is an indication of whether any 
intervention to reduce non-response will be successful. 


3. The Millennium Cohort Study 


The wave one sample of the UK Millennium Cohort 
Study (MCS) includes 18,552 families born over a 12- 
month period during the years 2000 and 2001, and living in 
selected UK electoral wards at age nine months. The initial 
response rate was 72%. Areas with high proportions of 


Table 1 
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Black and Asian families, disadvantaged areas and the three 
smaller UK countries are all over-represented in the sample 
which is disproportionately stratified and clustered as 
described in Plewis (2007b). The first four waves took place 
when the cohort members were (approximately) nine 
months, 3, 5 and 7 years old. At wave two, 19% of the target 
sample — which excludes child deaths and emigrants — were 
unproductive. The unproductive cases were equally divided 
between wave non-response and attrition, and between 
refusals and other non-productives (not located, not 
contacted efc.). 


4. Analyses of non-response 


4.1 Accuracy of discrimination and prediction 


Plewis (2007a) and Plewis ef al. (2008) show that vari- 
ables measured at wave one of the MCS that are associated 
with attrition at wave two are not necessarily associated 
with wave non-response then (and vice-versa). The same is 
true for correlates of refusal and other non-productives. 
Table 1 gives the accuracy estimates from the response 
propensity models. The estimate of the Gini coefficient for 
overall non-response (0.38) is relatively low: it corresponds 
to an AUC of 0.69 which is the probability of correctly 
assigning (based on their predicted probabilities) a pair of 
cases (one respondent, one non-respondent), indicating that 
discrimination between non-respondents and respondents 
from the propensity score is not especially good. Discrimi- 
nation is slightly better for wave non-respondents than it is 
for attrition and notably better for other non-productive than 
it is for refusal. These estimates were obtained from pair- 
wise comparisons of each non-response category with being 
a respondent. A similar picture emerges when we look at the 
slopes of the logit rank plots although these bring out more 
clearly the differences in predictiveness for the different 
types of, and reasons for non-response. 


Accuracy estimates from response propensity models, MCS wave two 


Accuracy measure Overall non-response 


Wave non-response 


Non-response type ” 


AUC“ 0.69 0.71 
Gini ‘? 0.38 0.42 
Logit rank plot: slope “” 0.45 0.51 
Sample size 18,230 16,210 


Non-response reason ” 


Attrition Refusal Other non-productive 
0.69 0.68 0.77 
0.39 0.37 0.53 
0.44 0.40 0.63 
16,821 16,543 16,513 


) AUC estimated under the binormal assumption (Krzanowski and Hand 2009); 95% confidence limits for (a) AUC not more than 


+ 0.015, (b) Gini coefficient and logit rank plot slope not more than 4 


E 0.03. 


© Based on a logistic regression, allowing for the survey design using the svy commands in STATA with the sample size based on the sum 


of the productive and relevant non-response category. 
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The correct specification of models for explaining non- 
response can be difficult to achieve. New candidates for 
inclusion in a model can appear after the model and the 
corresponding inverse probability weights have been esti- 
mated, others remain unknown. How much effect on 
measures of accuracy might the inclusion of new variables 
have? Here we examine the effects of adding three new 
variables to the MCS models: (i) whether or not respondents 
gave consent to having their survey records linked to health 
records at wave one; (ii) a neighbourhood conditions score 
derived from interviewer observations at wave two; and (111) 
whether, at wave one, the main respondent reported voting 
at the last UK general election. The first two of these 
variables were not available for the analyses summarised in 
Table 1: refusing consent at wave ¢ might be followed by 
overall refusal at wave ¢ + 1, and non-response might be 
greater in poorer neighbourhoods. The voting variable is an 
indicator of social engagement that might be related to the 
probability of responding. As the neighbourhood conditions 
score could not be obtained for cases that were not located, 
we use this variable just in the model that compares refusals 
with productives. 

Table 2 presents the results using the same methods of 
estimation as for Table 1 with corresponding levels of 
precision. We see (from the notes) that each of the three 
variables is associated with at least one kind of non- 
response. The increase in accuracy of the AUC is more than 
would be expected by chance (p < 0.001 apart from wave 
non-response: p > 0.06) but is small except for refusal 
where the inclusion of the three new variables does make a 
difference: the estimate of the Gini coefficient increases 


Table 2 


from 0.37 to 0.41 and the slope of the logit rank plot 
increases from 0.40 to 0.45 (although missing data for the 
neighbourhood conditions score does reduce the sample 
size). 


4.2 Using weights to adjust for non-response 


Although non-response at wave two of MCS is system- 
atically related to a number of variables measured at or after 
wave one, we have seen that the models’ ability to discrimi- 
nate between and predict categories of non-response is not 
high. We now consider what effect the weights generated 
from the response propensity models have on a longitudinal 
estimate of interest. We focus on transitions between not 
working and working across the two waves. As Groves 
(2006) argues, the keys to unlocking missingness problems 
of bias are to find those variables that predict whether a 
piece of data is missing, and which of those variables that 
predict missingness are also related to the variable of 
interest. We find that all the variables that predict overall 
non-response are also related to whether or not the main 
respondent works at wave two, conditional on whether she 
was working at wave one so we might expect the applica- 
tion of non-response weights to reduce bias. The results are 
presented in Table 3 and show that, compared with just 
using the survey weights, the introduction of the non- 
response weights based on the model underpinning Table 1 
leads to small adjustments in the estimated transition 
probabilities. The consent and vote variables have no 
additional effect, however, and this is consistent with the 
marginal increases in accuracy reported in Table 2. 


Accuracy estimates for enhanced response propensity models, MCS wave two 


Q) 


Accuracy measure Overall non-response 


Non-response type 


Non-response reason 


Wave non-response ” Attrition © Refusal “ Other non-productive © 
AUC 0.70 0.72 0.71 0.70 0.77 
Gini 0.41 0.44 0.41 0.41 0.54 
Logit rank plot: slope 0.47 0.52 0.46 0.45 0.65 
Sample size 18,148 16,177 16,745 15,656 16,443 


‘) Includes consent (odds ratio (OR) = 2.1, s.e. = 0.20) and vote (OR = 1.4, s.e. = 0.08). 
Includes vote only (OR = 1.4, s.e. = 0.11), consent not important (¢ = 1.33; p > 0.18). 


°) Includes consent (OR = 2.7, s.e. = 0.26) and vote (OR = 1.4, s.e. = 0.09). 
“ Includes consent (OR = 2.6, s.e. = 0.32), vote (OR = 1.3, s.e. = 0.10) and neighbourhood score (OR = 1.02, s.e. = 0.014). 
) Includes consent (OR = 1.6, s.e. = 0.20) and vote (OR = 1.5, s.e. = 0.11). 


Table 3 


Weighted employment transitions (standard errors), MCS wave two 


Variable Survey weights only 


Overall weight Overall weight ” 


No change 

Working — not working 
Not working — working 
Weight range” 

Sample size 


0.30 (0.0053) 
0.34 (0.0059) 
0.37 (0.0073) 
023220 
14,891 


(2 
©) All weights standardised to have mean of one. 
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0.30 (0.0056) 0.31 (0.0056) 
0.35 (0.0059) 0.35 (0.0060) 
0.35 (0.0073) 0.35 (0.0073) 
0.19 —4.1 0.19 -6.3 
14,796 14,733 


‘) Based on the product of the survey weights and the non-response weights using the model underpinning Table 1. 
-) Non-response weights based on a model that includes consent and vote. 
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5. Discussion 


Survey methodologists working with longitudinal data 
have long been exercised by the problem of non-response. 
Nearly all longitudinal studies suffer from accumulating 
non-response over time and it is common even for well- 
conducted mature studies to obtain data for less than half the 
target sample. On the other hand, a lot can be learnt about 
the correlates of different types of non-response by drawing 
on auxiliary variables from earlier waves. The main purpose 
of this paper has been to introduce a different way of 
thinking about the utility of the approaches that rely on 
general linear models both to construct inverse probability 
weights and to inform imputations. Treating the linear 
predictors from the regression models as response propen- 
sity scores and then generating ROCs enables methods for 
summarising the information in these scores to be used to 
assess the accuracy of discrimination and prediction for 
different kinds of non-response. 

The application of this approach to the Millennium 
Cohort Study has shown that, despite using a wide range of 
explanatory variables, discrimination is rather low. One 
implication of this finding is that some non-response is 
generated by circumstantial factors, none of them important 
on their own, which can reasonably be regarded as chance. 
There is some support for this hypothesis in that the 
accuracy of the models for overall non-response, wave non- 
response and other non-productive (the latter two being 
related) were little changed by the introduction of the voting 
and consent variables. On the other hand, these variables 
(and the neighbourhood conditions score) did improve the 
discrimination between productives, and attrition cases and 
refusals (which are also related). Nevertheless, discrimina- 
tion for these two categories remained lower than for the 
other types of non-response. A second possible implication 
is that the models do not discriminate well because data are 
not missing at random (NMAR) in Little and Rubin’s 
(2002) sense. In other words, it might be changes in circum- 
stances after the previous wave that influences non-response 
at the current wave. 

The implications of our findings for prediction are that it 
might be difficult to predict which cases will become non- 
respondents with a high degree of accuracy. If interventions 
to prevent non-response in longitudinal studies are to be 
effective then they need to be targeted at those cases least 
likely to respond because these cases are probably the most 
different from the respondents and therefore the major 
source of bias. This is where the ROC approach can be 
especially useful because, as Swets, Dawes and Monahan 
(2000) show, it is possible to determine the optimum 
threshold for the response propensity score based on the 
costs and benefits of intervening according to the true and 
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false positive rates implied by the threshold. A more 
detailed assessment of these issues is beyond the scope of 
this paper but would include considering interventions to 
prevent different kinds of non-response, and the benefits of 
potential reductions in bias and variability arising from a 
sample that is both larger and closer in its characteristics to 
the target sample. 
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Confidence interval estimation of small area 
parameters shrinking both means and variances 


Sarat C. Dass, Tapabrata Maiti, Hao Ren and Samiran Sinha ' 


Abstract 


We propose a new approach to small area estimation based on joint modelling of means and variances. The proposed model 
and methodology not only improve small area estimators but also yield “smoothed” estimators of the true sampling 
variances. Maximum likelihood estimation of model parameters is carried out using EM algorithm due to the non-standard 
form of the likelihood function. Confidence intervals of small area parameters are derived using a more general decision 
theory approach, unlike the traditional way based on minimizing the squared error loss. Numerical properties of the 
proposed method are investigated via simulation studies and compared with other competitive methods in the literature. 
Theoretical justification for the effective performance of the resulting estimators and confidence intervals is also provided. 


Key Words: EM algorithm; Empirical Bayes; Hierarchical models; Rejection sampling; Sampling variance; Small area 


estimation. 


1. Introduction 


Small area estimation and related statistical techniques 
have become a topic of growing importance in recent years. 
The need for reliable small area estimates is felt by many 
agencies, both public and private, for making useful policy 
decisions. An example where small area techniques are used 
in practice is in the monitoring of socio-economic and 
health conditions of different age-sex-race groups where the 
patterns are observed over small geographical areas. 

It is now widely recognized that direct survey estimates 
for small areas are usually unreliable due to their typically 
large standard errors and coefficients of variation. Hence, it 
becomes necessary to obtain improved estimates with 
higher precision. Model-based approaches, either explicit or 
implicit, are elicited to connect the small areas and im- 
proved precision is achieved by “borrowing strength” from 
similar areas. The estimation technique is also known as 
shrinkage estimation since the direct survey estimates are 
shrunk towards the overall mean. The survey based direct 
estimates and sample variances are the main ingredients for 
building aggregate level small area models. The typical 
modeling strategy assumes that the sampling variances are 
known while a suitable linear regression model is assumed 
for the means. For details of these developments, we refer to 
reader to Ghosh and Rao (1994), Pfeffermann (2002) and 
Rao (2003). The typical area level models are subject to two 
main criticisms. First, in practice, the sampling variances are 
estimated quantities, and hence, are subject to substantial 
errors. This is because they are often based on equivalent 
sample sizes from which the direct estimates are calculated. 
Second, the assumption of known and fixed sampling 
variances of typical small area models does not take into 


account the uncertainty in the variance estimation into the 
overall inference strategy. 

Previous attempts have been made to model only the 
sampling variances; see, for example, Maples, Bell and 
Huang (2009), Gershunskaya and Lahiri (2005), Huff, Eltinge 
and Gershunskaya (2002), Cho, Eltinge, Gershunskaya 
and Huff (2002), Valliant (1987) and Otto and Bell (1995). 
The articles Wang and Fuller (2003) and Rivest and Vandal 
(2003) extended the asymptotic mean square error (MSE) 
estimation of small area estimators when the sampling 
variances are estimated as opposed to the standard assump- 
tion of known variances. Additionally, You and Chapman 
(2006) considered the modelling of the sampling variances 
with inference using full Bayesian estimation techniques. 

The necessity of variance modelling has been felt by 
many practitioners. The latest developments in this area are 
nicely summarized in a recent article by William Bell of the 
United States Census Bureau 2008. He carefully examined 
the consequences of these issues in the context of MSE 
estimation of model based small area estimators. He also 
provided numerical evidence of MSE estimation for Fay- 
Herriot models (given in Equation 1) when sampling vari- 
ances are assumed to be known. The developments in the 
small area literature so far can be “loosely” viewed as (1) 
smoothing the direct sampling error variances to obtain 
more stable variance estimates with low bias and (i1) (par- 
tial) accounting of the uncertainty in sampling variances by 
extending the Fay-Herriot model. 

As evident, lesser or no attention has been given to ac- 
count for the sampling variances effectively while modeling 
the mean compared to the volume of research that has been 
done for modeling and inferring the means. There is a lack 
of systematic development in the small area literature that 
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includes “shrinking” both means and variances. In other 
words, we like to exploit the technique of “borrowing 
strength” from other small areas to “improve” variance esti- 
mates as we do to “improve” the small area mean estimates. 
We propose a hierarchical model which uses both the direct 
survey and sampling variance estimates to infer all model 
parameters that determine the stochastic system. Our meth- 
odological goal is to develop the dual “shrinkage” esti- 
mation for both the small area means and variances, ex- 
ploiting the structure of the mean-variance joint modelling 
so that the final estimators are more precise. Numerical 
evidence shows the effectiveness of dual shrinkage on small 
area estimates of the mean in terms of the MSE criteria. 

Another major contribution of this article is to obtain 
confidence intervals of small area means. The small area 
literature is dominated by point estimates and their asso- 
ciated standard errors; it is well known that the standard 
practice of [point estimate + q x standard error], where g 
is the Z (standard normal) or ¢ cut-off point, does not 
produce accurate coverage probabilities of the intervals; see 
Hall and Maiti (2006) and Chatterjee, Lahiri and Li (2008) 
for more details. Previous work is based on the bootstrap 
procedure and has limited use due to the repeated estimation 
of model parameters. We produce confidence intervals for 
the means from a decision theory perspective. The construc- 
tion of confidence intervals is easy to implement in practice. 

The rest of the article is organized as follows. The pro- 
posed hierarchical model for the sample means and vari- 
ances is developed in Section 2. The estimation of model 
parameters via the EM algorithm is developed in Section 3. 
Theoretical justification for the proposed confidence interval 
and coverage properties are presented in Section 4. Sections 
5 and 6 present a simulation study and a real data example, 
respectively. Some discussion and concluding remarks are 
presented in Section 7. An alternative model formulation for 
small area as well as mathematical details are provided in 
the Appendix. 


2. Proposed model 


Suppose 7 small areas are in consideration. For the i” 
small area, let (X,,.S7) be the pair of direct survey estimate 
and sampling variance, for i = 1, 2,...,”. Let Z, = (Z,,..., 
Lge be the vector of p covariates available at the esti- 
mation stage for the i small area. We propose the follow- 


ing hierarchical model: 
X,|0,,0; ~ Normal(6,, 07) 


APS (1) 
8, ~ Normal(Z; B, 1) 
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(1, —1)S; 2 


2 
D) 0; ye Xn,-1 


oO; (2) 


L 


oi ~ Gamma(a, b), 


independently for 7 = 1, 2,...,. In the model elicitation, 7; 

is the sample size for a simple random sample (SRS) from 
the i" area, B = (Cee is the p x 1 vector of regres- 
sion coefficients, and B = (a, b, B, t)’ is the collection of 
all unknown parameters in the model. Also, Gamma(a, b) 

is the Gamma density function with positive shape and scale 
parameters a and Jb, respectively, defined as f(x) = 

{b° T(a)} ex"! for x >0, and 0 otherwise. The un- 
known o* is the true variance of X, and is usually esti- 
mated by the sample variance S?. Although S?’s are as- 
sumed to follow a chi-square distribution with (n, —1) 

degrees of freedom (as a result of normality and SRS), we 
note that for complex survey designs, the degree of freedom 
needs to be determined carefully [e.g., Maples et al. 2009). 
More importantly, the role of the sample sizes in shrinkage 
estimation of co? is as follows: For low values of n,, the 
estimate of o;* is shrunk more towards the overall mean 
(ab) compared to higher n, values. Thus, for variances, 
sample sizes play the same role as precision in shrinkage 
estimation of the small area mean estimates. We note that 
You and Chapman (2006) also considered the second level 
of the sampling variance modelling. However, the hyper- 
parameters related to prior of Oo; are not data driven, they 
are rather chosen in such a way that the prior will be vague. 
Thus, their model can be viewed as the Bayesian version of 
the models considered in Rivest and Vandal (2003) and 
Wang and Fuller (2003). The second level modelling of 


co. in (2) can be further extended to ae ~ Gamma(b, 


exp(Z iva b) so that E(o,*) = exp(Z/B,) for another 
set of p regression coefficients B, to accommodate co- 
variate information in the variance modeling. 

Although our model is motivated by Hwang, Qiu and 
Zhao (2009), we like to mention that Hwang ef al. (2009) 
considered shrinking means and variances in the context of 
microarray data where they prescribed an important solution 
by plugging in a shrinkage estimator of variance into the 
mean estimator. The shrinkage estimator of the variance in 
Hwang et al. (2009) is a function of S? only, and not of 
both X, and S ics see Remarks 2 and 3 in Section 2. Thus, 
inference of the mean does not take into account the full 
uncertainty in the variance estimation. Further, their model 
does not include any covariate information. The simulation 
study described subsequently indicate that our method of 
estimation performed better than Hwang ef al. (2009). 

In the above model formulation, inference for the small 
area mean parameter 9, can be made based on the condi- 
tional distribution of 8, given all of the data {(X,, S;, Z,), 
i=1,...,n}. Under our model set up, the conditional 
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distribution of 8, is a non-standard distribution and does 
not have a closed form, thus requiring numerical methods, 
such as Monte Carlo and the EM algorithm, for inference, 
and the details are provided in the next section. 


3. Inference methodology 


3.1 Estimation of unknown parameters via EM 
algorithm 


In practice, B = (a, b, B, re ee is unknown and has to be 
estimated from the data {(X,, Se Zale oan gee OUE 
proposal is to estimate B by the marginal maximum like- 
lihood method: Estimate B by B where B maximizes the 
marginal likelihood L,,(B) = []7-, Ly, ; (B), where 


T(n,/2+a) At Oe ZB) 
tl (a)b* 20° 


M,i 


juin", 
and 


WV, = {05( —0,)° +0.5(n, —1)S? + it (4) 


The marginal likelihood L,, involves integrals that cannot 
be evaluated in closed-form, and hence, one has to resort to 
numerical methods for its maximization. One such algo- 
rithm is the EM (Expectation-Maximization) iterative proce- 
dure which is used when such integrals are present. The EM 
algorithm involves augmenting the observed likelihood 
L,,(B) with missing data; in our case, the variables of the 
integration, 0,, i = 1,2,...,”, constitute this missing infor- 
mation. Given 0 = {0,,0,,...,0,}, the complete data log 
likelihood (/..) can be written as 


An 


(.(B, 8) = >) | log{I'(n,/2 + a)} - log{T(a)} 


i=l 
— alog(b) — 0.5log(t’) 


i j 
ee. De aoxw) | 
jae 


where the expression of y, is given in Equation (4). 
Starting from an initial value of B,B‘ say, the EM 
algorithm iteratively performs a maximization with respect 
to B. At the t™ step the objective function maximized is 


O(B| BY) = E(¢.(B, 8)) 


SY] NaeirGh ie edo kta) 


= 
—alog(b)—0.5log(t”) 
a E(0,—Z; B)” 


— (n,/2+a)E {log(y;)} |. 
ZF 


$75 


The expectation in O(B | B“'’) is taken with respect to 
the conditional distribution of each 9, given the data, 
m(Oallay'S, <Z Baar ewvhich is 


TOK SZ ee) & 
exp{—0.5 (0, —Z, 8)? eran, (5) 


One challenge here is that the expectations are not avail- 
able in closed form. Thus, we resort to a Monte carlo 
method for evaluating the expressions. Suppose that R iid 
samples of 9, are available, say 9, ,,0; 5,...,9; . Then, 
each expectation of the form E£{h(0,)} can be approxi- 
mated by the Monte Carlo mean 


R 
E(H(8,)} ~ ZYME, (6) 
r=1 
However, drawing random numbers from the conditional 
distribution 12(0,|X,,S°,Z,,B“) is also not straight- 
forward since this is not a standard density. Samples are 
drawn using the accept-reject procedure (Robert and Casella 
2004): For a sample from the target density /, sample x 
from the proposal density g, and accept the sample as a 
sample from f with probability f(x)/{M g(x)} where 
vo sup, {/(x)/g(x)}. One advantage of the accept- 
reject method is that the target density f only needs to be 
known upto a constant of proportionality which is the case 
for 1(0,|X,, one Z;, B“") in (5); due to the non-standard 
form of the density, the normalizing constant cannot be 
found in a closed form. For the accept-reject algorithm, we 
used the normal density g(0,) « exp{—0.5(0,— Z/ B)*/ 
t+ as the proposal density. The acceptance probability is 
calculated to be [{1/b + 0.5(n, — 1) S7}/{1/b + 0.5(n,- 1) 
S? + 0.5(0,— X,)?}]"'***. One can choose a better pro- 
posal distribution to increase acceptance probability or 
different algorithm (such as the adaptive rejection sampling 
or envelope accept-reject algorithms) but our chosen 
proposal worked satisfactorily in the studies we conducted. 
The maximizer of O(B|B“”) at the ¢” step can be 
described explicitly. The solutions for B and t° are 
available in closed form as 


po [52,27 2.600)| 


and 


(2) =2 YE, -Z78y, 


l n 
Wisi 
respectively. Also, a“? and b“ are obtained by solving 


S,=00(B|B“)/da=0 and S,=00(B| B“~?)/6b=0 
using the Newton-Raphson method where 
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n 


0 
Ser dia, ost, /2+ a)} 


x nfo ogi (a)| — nlog(b) — > Etlog(y;)} 


and 


We set B” = (a'?, 6, B, (¢)?) and proceed to the 
(t +1)-st step. This maximization procedure is repeated 
until the estimate B“’ converges. The MLE of B, Bes 
B™, once convergence is established. 


3.2 Point estimate and confidence interval for 0; 


Following the standard technique, the small area esti- 
mator of 0, is taken to be 


0. Sat a ee S52, B)| (7) 


the expectation of 0, with respect to the conditional density 
1(0;| X;, S°, Z,,B) with the maximum likelihood estimate 
B plugged in for B. The estimate 6. is calculated nu- 
merically using the Monte Carlo om (6) described in 
the previous section. Subsequently, all quantities involving 
the unknown B will be plugged in by B although we still 
keep using the notation B for simplicity. 

Further, we develop a confidence interval for 9, based 
on a decision theory approach. Following Joshi (1969), 
Casella and Hwang (1991), Hwang etal. (2009), consider 
the loss function associated with the confidence interval C 
given by (k/o)L(C) — /.(8) where & is a tuning para- 
meter independent of the model parameters, L(C) is the 
length of C and /.(8) is the indicator function taking 
values | or 0 depending on whether @ € C or not. Note that 
this loss function takes into account both the coverage 
probability as well as the length of the interval; the positive 
quantity (4/0) serves as the relative weight of the length 
compared to the coverage probability of the confidence 
interval. If & = 0, the length of the interval is not under 
consideration, which leads to the optimal C to be (—9, 0) 
with coverage probability 1. On the other hand, if k = «, 
then the coverage probability is 0, leading to optimal C’ to 
be a point set. The Bayes confidence interval for 9, is 
obtained by minimizing the risk function (the expected loss) 
E{[(k/o) L(C) = I,(8)] | X;,8;,Z,, B)}. 
choice of C is given by 


C,(B) = 
{Or kB(G. (XS, a2 BY Or eas 2 nt tS) 


The optimal 
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Since C,(B) is obtained by minimizing the posterior risk, 
one may like to interpret this as a Bayesian credible set. 
However, following Casella and Berger (1990, page 470), 
we will continue naming C,(B) as a confidence interval. 
From an empirical Bayes perspective also, this terminology 
is more appropriate. How the tuning parameter k deter- 
mines the confidence level of C,(B) will be shown ex- 
plicitly in Section 3.3. 

Assuming k is known for the moment, we follow the 
steps below to calculate C,(B). The conditional densities of 
co, and 0, are given by 


(6; | X,,S;, Z,, B) < 


Ee AZ Be 
"es (X, 2278) 


(97 +1”) - {0.50 —1)S) + Alba 
were ee ee Sy 


(oO Ga (Ge we ae 


and (5), respectively, which as mentioned before, are not 
available in closed form. Thus, similar to the case of 0,, 
E(o,'|X,,S;,Z,,B) is computed numerically using the 
Monte Carlo method by approximating the expected value 
with the mean 1/N >;-, 1/o,, where ere n= yee R 
are R samples from the conditional density m(o7|X,, 
S°,Z,,B). The accept beet procedure is used to draw 
random numbers from (oc; | X,,5,,Z,,B) with a pro- 
posal density given by the inverse Gamma 


oa (050, ~1)S? + i] 


(G hk —1)/2+a+l 2 


and the acceptance probability 


if 2 
25 | NCEA) 
(9; +T ) i 
(6? +7)? x exp(0.5) x |X; — Z; Bl. 

The next step is to determine the boundary values of C,(B) 
by finding two 90, values that satisfy the equation 
kE(o;'|X,,S;, Z,, B) — x(0,| X,, S;, Z,, B) = 0. This re- 
quires the normalizing constant 1n (5) 


D, = ie exp{-0.5(8, — Z7B)?/22} yr" dO, 


to be evaluated numerically. This is obtained using the 
Gauss-Hermite integration with 20 nodes. 


3.3. Choice of k 


The choice of the tuning parameter & in (8) is taken to 
be 
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qe) 
Kier KB ie lt ad (10) 


where @ is the standard normal distribution, ¢,,, is 
(1—a/2)" percentile of ¢ distribution with (n, —1) 
degrees of freedom, and wu, =, 1+o;/t’. Since U; 9 
involves o; which is unknown, an estimated version Ui; 9 1s 
obtained by plugging in the maximum a posteriori estimate 


CeO by ae max Oo 1A 7, Bl. EL) 
S: B=B 


in place of Ge Also, B is replaced by B in (11). We 
demonstrate that the coverage probability of C;, (B) with 
this choice of k is close to 1 — a. Theoretical justifications 
are provided in Section 4. 


3.4 Other related methods for comparison 


Our method will be denoted as Method I. Three other 
methods to be compared are briefly described below. 


Method II: Wang and Fuller (2003) considered the Fay- 
Herriot small area estimation model given by (1). Their 
primary contribution is the construction of the mean 
squared error estimation formulae for small area esti- 
mators with estimated sampling variances. In the process, 
they had constructed two formulae denoted by MSE, 
and MSE2. We use MSE; for our comparisons, which 
was derived following the bias correction approach of 
Prasad and Rao (1990). The basic difference with our ap- 
proach is that they did not smooth the sampling vari- 
ances, only taking the uncertainty into account while 
making inference on the small area parameters. The 
method of parameter estimation, which is moment based 
for all the model parameters, is also different from ours. 


Method III: Hwang et al. (2009) considered the log-normal 
and inverse Gamma models for oe in (2) for microarray 
data analysis. Their simulation study showed improved 
performance of confidence intervals for small area esti- 
mators under the log-normal model compared to the inverse 
gamma. We thus modified their log-normal model to add 
covariates and for unequal sample sizes n, as follows: 


X,|0,,0; ~ Normal(6,, 7) 
(12) 
Q@ ~ Normal(Z/ B, 7”); 


log S? = log(o*) + 5,; 6, ~ N(m, 0%, ;) 


(13) 
LOS (Gace wd aa), 


independently for 7 = 1, 2,...,. Note that the model for the 
means in (12) is identical to (1). The quantities 1”, m, and 
G2, are assumed to be known and are given by m, = 


Eflog(x2 _,/(n,- 1))] and o%, = Var[log(y;,_,/(n,— 1))]. 


U7e 


Thus, the sample size n,’s determine the shape of the 7 
distribution via its degrees of freedom parameter. More 
importantly, as mentioned earlier, the different sample sizes 
account for different degrees of shrinkage for the corre- 
sponding true variance parameter. Similar to their esti- 
mation approach, the unknown model parameters p., and 
Tt are estimated using a moment based approach in an 
empirical Bayes framework giving (i, and 7°, respectively. 
Note that in Hwang ef al. (2009), these estimates are ob- 
tained based on the hierarchical model for on of (13) only 
without regard to the modelling (1) of the mean. We refer to 
the Section 5 of their paper for details of the estimation of 
the hyper-parameters. We follow the same procedure using 
only (13) to estimate , and 7 in the case of unequal 
sample sizes. 
The Bayes estimate of o* is derived to be 


Si» = exp| E{In(o;) | In(S;)} | 


2 Oe 
a | exp{u,(1— M,,,)} 
exp(m; ) 
where M,, = t,/(t;+0%,,) and with estimates plugged 
in for the unknown quantities. The conditional distribution 
of 0, given (X,, S7), is 
1(0,|X;, $2) = | 1(0, | X,, 82,62) n(o? | X,, $2) do?, 
0 
is approximated as (0, | X,,S,) = |g step Sa es 
t(o, |X,,S,)do; = x(0,| X,, S;, 6; ,). This suggests the 
approximate Bayes estimator of the small area parameters 
given by 


0, = E(0,|X,,67,) = M,X,+(1- M,)Z/B, (14) 
where M, = t./(#, +6; ,). The confidence interval for 9, 
is obtained as 


pOrB 


Gee 0. aa ini On cll) (15) 


In Section 3 of Hwang ef al. (2009) pages 269-271, the 
interval CF is matched with the 100(1-—0a)% _ tf -interval 
[|6.—X,|<+¢S,] to obtain the expression of k as 
k =k, = exp{—1/2} exp{m,/2}/(/2n). 


Method IV: This method comprises of a special case of the 
Fay-Herriot model in (1) but with the estimation of model 
parameters adopted from Qiu and Hwang (2007). Qiu and 
Hwang (2007) considered the model 


X,|0,,06° ~ Normal(6,, 0°) 
(16) 


8, ~ Normal(0, t*), 
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independently for i = 1, 2, ..., n, for analyzing microarray 
experimental data. When model parameters are known, they 
proposed the point estimator 0, = MX ,M =(1-((n- 
2)o0°/|X|)), where a, denotes max(0,a) for any 
number a and |X| = (~".,X/)'*. The confidence interval 
for 0, is 0, ee v,(M), where v(M) = oM(q, ~ In(M)) 
with g, denoting the standard normal cut-off point corre- 
sponding to desired level of confidence coefficient and 
v,(0) = 0. Here For the purpose of comparisons with our 
method, the first level of the hierarchical model in (16) is 
modified as follows: 


sigh 
Xo Ze DeVere) 


where v, ~ Normal(0, 17) and e, ~ Normal(0, S;) inde- 
pendently for i = 1,2,...,n, and SE is treated as known. 
Following Qiu and Hwang (2007), 1” is estimated by 


] 
Gers 


~2 
18 — 


=| 
> a - > S) 41-Z; [s Z, z) Ze 
i i i=l 


and t? = max(t’,1/n) where a, = X, ay 4Ny and 6 = 
(x",Z,Z1)\(x"_,Z,X,). Next, define M,,= 77/(#°+S;) 
and M, = max(M,,, M,) where in the latter expression, 
M,, is truncated by M,, = 1-Q,/(n,-2), and Q, is the 
a." quantile of a chi-squared distribution with n, degrees 
of freedom. This M ; is used in the formula of the confi- 
dence interval for 98, given earlier. When applying this 
method in our simulation study and real data analysis, we 
modified the model to accommodate such unequal sample 


sizes and covariate information mentioned earlier. 


Remark 1. Hwang et al. (2009) choose k by equating (15) 
to the ¢ interval based on only X, for the small area 
parameters 8, Note that X, is the direct survey estimator. 
Consequently, this choice of k does not have any direct 
control over the coverage probability of the interval con- 
structed under shrinkage estimation. On the other hand, our 
proposed choice of & has been derived to maintain nominal 
coverage under, specifically, shrinkage estimation. 


Remark 2. Note that without any hierarchical modelling 
assumption, S, and X, are independent as S? and_X, are, 
respectively, ancillary and the complete sufficient statistics 
for 8, However, under models (1) and (2) the conditional 
distribution of o; and 0, involve both X, and S; which 
is seen from (5) and (9). 


Remark 3. In Hwang etal. (2009), the shrinkage estimator 
for on is based only on the information on S 4 and not of 
both xX, and S°. The Bayes estimator of on is plugged 
into the expression for the Bayes estimator of small area 
parameters. Thus, Hwang ef al.’s small area estimator is 


ra 


written as E(0,| X,,6;,) in (14) where 6;, is the Bayes 
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estimator of O.. Due to equation (9), the shrinkage 
estimator of «7 depends on (X, — Z/B)’ in addition to 
S? in contrast to Hwang eral. (2009). We believe this 


could be the reason for improved performance of our 
method compared to Hwang et al. (2009). 


Remark 4. As mentioned previously, the degree of freedom 
associated with the y° distribution for the sampling vari- 
ance need not to be simply 1, — 1, 1, being the sample size 
for i area. There is no sound theoretical result for deter- 
mining the degree of freedom when the survey design is 
complex. The article Wang and Fuller (2003) approximated 
the y° with a normal based on the Wilson-Hilferty ap- 
proximation. If one knows the exact sampling design then 
the simulation based guideline of Maples et al. (2009) could 
be useful. For county level estimation using the American 
Community Survey, Maples etal. (2009) suggested the 
estimated degrees of freedom of 0.36 x ./n,. 


4. Theoretical justification 


Theoretical justification for the choice of k according to 
equation (10) is presented in this section. As in Hwang ef al. 
(2009), the conditional distribution of ®, given X, and S? 
can be approximated as (0, | X,,S;, B) ~ (0, | X,,.S;, B, 
G:), where 6? as defined in (11). In a similar way, 
approximate E(o,'| X,, 5°, B) by E(o;'|X,, S;, B)~6;'. 
Based on these approximations, we have C,(B) ~ C,(B) 
where C,(B) is the confidence interval for 8, given by 
C(BY= {0:1 (0, (NGS, By G-) 2h Oars CF Tom (beet 
follows that the conditional density 1(0, | X,, S’, B,o7) isa 
normal with mean 1, and variance v,, where 1, and v, are 
given by the expressions 


po aWn Ae (le w) ZB, 


=i} No 
sae 1 Aes O; 
jee Seamer =G. IRS pager : 
GH 2h [ 


1/0; 
Oa rs Mae FHT 
CHES 31 T-) 


(17) 


and 


Now, choosing 


; D 
ne il aad 
(| aaeel 


as discussed, the confidence interval rE. ;(B) becomes 


fa Dieta nd 
a (18) 


c jet 


es On ath: 
C(B) = ie Fees ~ Bil < ~ 
(oy 
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where [i, is the expression for 1, in (17) with Go; replaced 
by on Now consider the behavior of 6; = 6; (B) as T 


ranges between 0 and 0. When t* + o, 6; converges to 


Cir ise i (1, —1)S; ine 
&7(@) = 67(a, b, B, 0) =$ —=—— 2 - 


.—] 
= 4 it 2a 


a Ne a2 
Similarly, when t* — 0, 6; converges to 


(X,-Z/B) + (n,- NsP+= 


es OO) 
6: (0) = G7 (a, b, B, 0) = 
; (0) = 6; (a, 5, B, 0) ew are 


For all intermediate values of t7, we have min{6;(0), 
G* (co)} < on = max {67 (0), 6; (0)}. Therefore, it is suffi- 
cient to consider the following two cases: (i) 67 > 6; (©), 
where it follows that (n,+ 2a+ 2)6? = (n,+2a+1)6; + 
6; >(n,-1)S? +2/b+67 =(n,-1)S?, and (ii) 67 =67(0), 
where it follows that (n,+2a+2) 6° =(X,-Z/B)°+ 
(n,—1)S? +2/b >(n,-1)S?. So, in both cases (i) and (ii), 


(n, + 2a + 2) 6? = (n, -1)S?. (19) 


Since 9,-p,~N(O,6;0'/(o; +") and (n,-1)S//6;~ 
ets the confidence interval 


D; = (8 ta) - a Ss a] (20) 


i 


has coverage probability 1—a. Thus, if wu) and p, are 
replaced by uw, and fi,, it is expected that the resulting 
confidence interval D., say, will have coverage probability 
of approximately 1 — a. From (19), we have 


PIC (B)) > PD.) 2 lo, (21) 


establishing an approximate lower bound of 1— a for the 
confidence level of C.(B). 

In (21), B was assumed to be fixed and known. When 
B is unknown, we replace B by its marginal maximum 
likelihood estimate B. Since (21) holds regardless of the 
true value of B, substituting B for B in (21) will involve 
an order O(1/\/N) of error where N =)/j_\n,. Compared 
to each single n,, this pooling of n,’s is expected to reduce 
the error significantly so that C (B) is sufficiently close to 
G (B) to satisfy the lower bound of | — a in (21). 


5. A simulation study 


5.1 Simulation setup 


We considered a simulation setting using a subset of 
parameter configurations from Wang and Fuller (2003). 


79 


Each sample in the simulation study was generated from the 
following steps: First, generate observations using the 
model 


A= Pures 


where u,~N(0, 1°) and e, ~N(0, 1; co, ), independently 
for j =1,...,n, and i =1,...,”. Then, the random effects 
model for the small area mean, X,, is 


A, = b+ wee, “imdependently fona= I, ....n, 


where X¥, = X, =n; ' Via Xpand e.=1e) = n, phe, 
Therefore, X,~N(0,,0;) where 0,=B +u,, 0,~N(B, 
t) and e,~N(0,o;). We estimated o? with the unbi- 


ased estimator 
De) -l. -l TT \2 
Summa dl coeale el, DMO ee Gelli 
yo 


and it follows that (n, —1)S7/o;~ y°_,, independently 
for i =1,2,...,n. Note that the simulation layout has 
ignored the second level modeling of sampling variances in 
(2). Thus, our result will indicate robustness with respect to 
the variance model misspecification. 

The above steps produced the data (X,, S;),i =1,..., 7. 
To simplify the simulation, we do not choose any covariate 
information Z,. Similar to Wang and Fuller (2003), we set 
all n,’s equal to m to ease programming efforts. However, 
the true sampling variances are still chosen to be unequal: 
One-third of the o? are set to 1, another one-third are set to 
4, and the remaining one-third are set to 16. We take 
8 = 10 and three different choices of t? =0.25, 1 and 4. 
These parameter values are chosen from Qiu and Hwang 
(2007). For each of t*, we generated 200 samples for the 
two combinations (m,n) = (9, 36) and (18, 180). 

In the simulation study, we compare the proposed 
method with the methods of Wang and Fuller (2003), 
Hwang ef al. (2009) and Qiu and Hwang (2007) which are 
referred to as Methods I, II, Il, and IV, respectively, based 
on bias, mean squared error (MSE), coverage probability 
(CP) of the confidence intervals and the length of the confi- 
dence intervals (ALCI). Table 1 contains the parameter 
estimates for a, b, B and t°. The numerical results indi- 
cate good performance of the maximum likelihood esti- 
mates for the model parameters; the estimated values of B 
and t° are close to the true values indicating good robust- 
ness properties with respect to distributional misspecifi- 
cation in the second level of (2). Statistically significant 
estimates for both a and 6 indicate that “shrunk” sampling 
variances are incorporated in the proposed method. Tables 
2, 3 and 4 provide numerical results averaged over areas 
within each group having the same true sampling variances. 
The results in the Tables are based on 200 replications. 
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Table 1 
Simulation results for the model parameters, a (top left panel), b (top right panel), B (bottom left panel) and t? (bottom right 
panel). Here SD represents the standard deviation over 200 replicates. We took B = 10 and t? = 0.25, 1 and 4 


n = 36, m =9 n = 180, m = 18 n = 36, m =9 n = 180, m = 18 
7? Mean SD Mean SD a Mean SD Mean SD 

a b 

0.25 1.0959 0.1540 1.0328 0.0442 0.25 0.3992 0.0983 0.4249 0.0323 

1 1.0937 0.1555 1.0325 0.0445 i] 0.4030 0.1012 0.4253 0.0326 

4 1.0996 0.1577 1.0339 0.0450 4 0.3999 0.1017 0.4245 0.0328 
B T 

0.25 10.0071 0.3618 9.9951 0.1853 0.25 0.2558 0.0605 0.2575 0.0097 

l 10.0142 0.3311 9.9970 0.1743 1 0.9418 0.3333 1.0426 0.1264 

4 10.0282 0.4639 10.0048 0.2254 4 3.5592 1.3316 4.0817 0.5551 


Table 2 
Simulation results for prediction when ~* = 0.25. Here MSE, ALCI, CP represent the mean squared error, average confidence 
interval width, and coverage probability, respectively 


n = 36, m=9 n = 180, m = 18 
Method Method 

o; I II Hl IV I I Il IV 
Relative 0.0048 0.0198 0.0272 0.0018 -0.0051 -0.0086 -0.0112 -0.0111 
bias 4 -0.0033 -0.0061 -0.0145 -0.0158 -0.0130 -0.0109 -0.0065 -0.0116 

16 0.0126 0.0370 0.0369 0.0096 -0.0046 -0.0045 -0.0080 -0.0061 
MSE 1 0.3066 0.3890 0.6861 0.3805 0.2258 0.2680 0.4470 0.2922 

4 0.3281 0.5430 1.3778 0.7285 0.2595 0.3000 0.5805 0.3748 

16 0.3715 0.5240 1.6749 1.9316 0.2815 0.2850 0.4856 0.6383 
ALCI 1 2.1393 2.5485 4.4906 3.0528 1.9220 1.6006 3.6466 2.4811 

4 2.2632 3.9574 6.8887 5.6842 2.0 55u7 2.1524 5.2472 4.2160 

16 Mes) 4.5619 9.3335 HIMIS63 2.1046 2.3308 6.5273 7.8492 
CE | 0.9468 0.9770 0.9771 0.9708 0.9564 0.9710 0.9851 0.9631 

4 0.9468 0.9710 0.9829 0.9917 0.9555 0.9660 0.9967 0.9967 

16 0.9365 0.9660 0.9933 09975 0.9529 0.9610 0.9998 0.9999 


Table 3 
Simulation results for prediction when t? =1. Here MSE, ALCI, CP represent the mean squared error, average confidence 
interval width and coverage probability, respectively 


n= 36, m=9 n = 180, m = 18 
Method Method 

G; I I Il IV I Il Il IV 
Relative 1 -0.0152 0.0205 0.0255 0.0051 -0.0064 -0.0085 -0.0111 -0.0101 
bias 4 -0.0167 -0.0164 -0.0151 -0.0219 -0.0151 -0.0121 -0.0133 -0.0164 

16 -0.0323 0.0508 0.0515 0.0216 -0.0028 -0.0017 -0.0073 -0.0039 
MSE 1 0.5645 0.6330 0.7238 0.6260 0.5288 0.5430 0.5673 0.6336 

4 0.8566 1.1100 1.5396 1.0992 0.8159 0.8770 0.9415 0.8948 

16 1.0482 1.3100 2.1059 2.3156 0.9786 1.0000 1.1024 1.1878 
ALCI 1 3.4550 3S 22 4.4938 Sey 3.1088 2.5094 3.6763 2.8676 

4 4.0321 5.8733 6.8984 D.7909 3.7844 4.2908 3825 4.5543 

16 4.4082 7.4286 9555 MES S5 4.1187 5.1590 6.6785 7.8937 
Cle ] 0.9704 0.9640 0.9762 0.9275 0.9660 0.9650 0.9786 0.8879 

4 0.9633 0.9560 0.9812 0.9808 0.9627 0.9680 0.9918 0.9740 

16 0.9533 0.9490 0.9912 0.9938 0.9613 0.9680 0.9974 0.9979 
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Simulation results for prediction when 1” = 4. Here MSE, ALCI, CP represent the mean squared error, average confidence 
interval length and the coverage probability, respectively 


n = 36, m= 9 
Method 

o; I Il Tl 
Relative 1 -0.0024 0.0248 0.0229 
bias 4 -0.0343 -0.0310 -0.0210 
16 -0.0147 0.0702 0.0767 
MSE l 0.8822 0.8590 0.8579 
4 2.0577 2.2900 2.1818 
16 3.4516 3.7600 3.9267 
ALCI l 4.6318 4.1936 4.5369 
4 6.2015 10.9093 7.0376 
16 1 hoo 18.0039 9.6718 
GP 1 0.9791 0.9670 0.9733 
4 0.9556 0.9670 0.9725 
16 0.9510 0.9670 0.9796 


n = 180, m = 18 


Method 
IV I II Ill IV 
0.0180 -0.0084 -0.0098 -0.0122 -0.0106 
-0.0340 -0.0110 -0.0092 -0.0174 -0.0132 
0.0467 0.0016 0.0024 -0.0059 0.0012 
1.0559 0.8359 0.8180 0.8541 0.8605 
2.2422 2.0424 2.1000 2.0935 21130 
3.8981 3.3153 3.3500 3.3939 3.3631 
3.7677 4.0256 3.5346 3.9626 3.7499 
6.4314 5.9000 9.0913 6.2217 6.1540 
11.3341 7.4430 14.6665 8.3908 8.7537 
0.9029 0.9674 0.9570 0.9600 0.9468 
0.9496 0.9592 0.9610 0.9633 0.9573 
0.9858 0.9573 0.9650 0.9718 0.9776 


Bias Comparisons: In most cases, the bias of the four 
methods are comparable. There is no clear evidence of 
significant differences between them in terms of the bias. 
High sampling variance gives more weight to the population 
mean by construction that makes the estimator closer to the 
mean at the second level. On the other hand, Methods I - III 
use shrinkage estimators of the sampling variances which 
would be less than the maximum of all sampling variances. 
Thus, Methods I - III tend to have little more bias. However, 
due to shrinkage in sampling variances, one may expect a 
gain in the variance of the estimators which, in turn, makes 
the MSE smaller. Among Methods I-III, Method I 
performed better compared to Methods II and III, which 
were quite similar to each other. The maximum gain using 
Method I compared to Method II is 99%. 


MSE Comparisons: In terms of the MSE, Method I 
performed consistently better than the other three in all cases 
except when the ratio of o; to 1” is the lowest: (o7 = 
1) /(t°= 4) =0.25. In this case, the variance between 
small areas (model variance) is much higher than the 
variance within the areas (sampling variance). When using 
our method to estimate 9, the information “borrowed” 
from other areas may misdirect the estimation: The esti- 
mated mean of the Gamma distribution for o,° from the 
second level in (2) is ab which equals 0.44 approximately 
for both the (m,n) combinations of (9, 36) and (18, 180) 
(the true value is ab =0.4). Thus, E(o,°|X,,5;,B) is 
significantly smaller than 1 due to shrinkage towards the 
mean for the group which has the true value of o; = 1. 
Also, since on is smaller than t’, the weight of X, should 
be much more compared to 8, the overall mean. However, 


due to underestimation of o,° in this case, the resulting 
estimator puts less weight on X, which leads to higher 
MSE. However, this underestimation will decrease for large 
sample sizes due to the consistency of Bayes estimators. 
This fact is actually observed when the sample size 
increases from n =36 to n = 180 for the case o; =1 and 
t = 4, Compared to Method II, Method I shows gains in 
most of the simulation cases; the maximum gain is 30% 
while the only loss is 9% for the combination 6? = 1 and 
t =4 for n =36 and m =9. Similarly, for Method III, the 
maximum gain of Method I is 77% and the only loss of 11% 
is for the same parameter and sample size specifications. 


ACP Comparisons: We obtained confidence intervals with 
confidence level 95%. Methods I and III do not indicate any 
under-coverage. This is expected from their optimal confi- 
dence interval construction. Method I meets the nominal 
coverage rate more frequently than any other methods. 
Method II has some under coverage and can go as low 
as 82%. 


ALCI Comparisons: Method I produced considerably 
shorter confidence intervals in general. Method IV produced 
comparable lengths as the other methods in all cases except 
when G, was high, in which case, the lengths were 
considerably higher. The confidence interval proposed in 
Qiu and Hwang (2007) does not have good finite sample 
properties, particularly for small t*. To avoid low coverage, 
they proposed to truncate M) = w(t +07) with a 
positive number M, =1-—Q,/(v—2) for known o; 
where Q, is the a -quantile of a chi-squared distribution 
with v degrees of freedom. When the ratio of sampling 
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variance to model variance, er /t’, is high, M, tends to be 
higher than M). This results in a nominal coverage but 
with larger interval lengths. For example, in case of 
(o;,t”) =(16, 0.25), the ALCI is 11.13 for Method IV 
whereas ALCI is only 2.78 and 4.56 for Methods I and I. 


5.2 Robustness study 


In order to study the robustness of the proposed method 
with respect to departures from the normality assumption in 
the errors, we conducted the following simulation study. 
Data was generated as before but with e,’s drawn from a 
double-exponential (Laplace) and an uniform distribution. 
The estimators from Methods II and III had little effect. This 
is perhaps due to the fact that these methods used moment 
based estimation for model parameter estimation. Method 
IV resulted in larger relative bias, MSE and ALCI, and 
lower coverage probability. The MSE from Method I is 
always lower than that from Method II. For T= Us25 and 
1, ALCI is smaller for Method I compared to Method II for 
(n = 36, m = 9) but the results are opposite when (n = 
180, m = 18). In terms of CP, Method II has some under 
coverage (lowest is 80%). However, Method I did not have 
any under-coverage. In order to save space we only provide 


Table 5 


the results for parameters a, b, 8 and t under the Laplace 
errors (see Table 5). 


6. Real data analysis 


We illustrate our methodology based on a widely studied 
example. The data set is from the U.S. Department of 
Agriculture and was first analyzed by Battese (1988). The 
data set is on corn and soybeans productions in 12 lowa 
counties. The sample sizes for these areas are small, ranging 
from | to 5. We shall consider corn only to save space. For 
the proposed model, the sample sizes n, > 1 necessarily. 
Therefore, modified data from You and Chapman (2006) 
with m, = 2 are used. The mean reported crop hectares for 
corn (X,) are the direct survey estimates and are given in 
Table 6. Table 6 also gives the sample variances which are 
calculated based on the original data assuming simple 
random sampling. The sample standard deviation varies 
widely, ranging from 5.704 to 53.999 (the coefficient of 
variation varies from 0.036 to 0.423). Two covariates are 
considered in Table 6: Z;,, the mean of pixels of corn, and 
Z,>, the mean of pixels of soybean, from the LANDSAT 
satelite data. 


Simulation results for the model parameters, a (top left panel), 5 (top right panel), B (bottom left panel) and t” (bottom right 
panel) when the errors follow a laplace distribution. Here SD represents the standard deviation over 200 replicates. We took 


8 =10 and t” = 0.25, 1 and 4 


n = 36, m=9 n = 180, m = 18 
7? Mean SD Mean SD 

a 

0.25 0.9624 0.1632 0.9471 0.0498 

1] 0.9628 0.1657 0.9476 0.0497 

4 0.9689 0.1694 0.9487 0.0499 
B 

0.25 9.9736 OST 9.9800 0.1773 

l 9.9753 0.3709 9.9836 0.1662 

4 9.9736 0.4835 9.9855 0.2161 


Table 6 
Corn data from You and Chapman (2006) 

County n; X; 
Franklin 3 158.623 
Pocahontas 3 102.523 
Winnebago 3 i a ae 
Wright 3 144.297 
Webster 4 LARS 9S 
Hancock 5 109.382 
Kossuth 5 110.252 
Hardin 5 120.054 
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n = 36, m=9 n = 180, m = 18 
a Mean SD Mean SD 
b 
O25 0.5793 0.1733 0.5279 0.0501 
l 0.5816 i O25 0.0503 
4 0.5758 0.1796 0.5263 0.0503 
me 
0.25 0.2696 0.0882 0.2565 0.0074 
1 1.0508 0.2501 1.0403 0.0668 
4 3.9624 1.1719 4.1256 0.4201 
Zi 23; S} 
Bhilts 7 188.06 5.704 
2x TAT 247.13 43.406 
ye Ne ete jae 30.547 
301.26 ya Ipes.|3) 53.999 
262.17 247.09 21.298 
314.28 198.66 15.661 
298.65 204.61 12.112 
325.99 177.05 36.807 
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The estimates of B are as follows: a =1.707, b= 
0.00135, t =90.58 and B = (186.0, 0.7505, 0.4100). 
The estimated prior mean of 1 /o; which is the mean of the 
Gamma distribution with parameters a and b is ab = 
0.002295 with a square root of 0.048 (note that 1 / 0.048 = 
20.85 consistent with the range of the sample standard 
deviations between 5.704 and 53.999). The small area esti- 
mates and their confidence intervals are summarized in 
Table 7 and Figure |. Point estimates of all 4 methods are 
comparable: the summary measures comprising of the 
mean, median, and range of the small area parameter esti- 
mates for Methods I, II, III, and IV are (121.9, 124.1, 
I ale. O)all25.2,0204,115.0,114:5) .and.(23:1453.0, 
58.4, 56.6), respectively. The distribution of 0. (plotted 
based on considering all the 7’?s) are summarized in Figure 
2 which shows that there is a significant difference in their 
variability. Method I has the lowest variability and is 
superior in this sense. Further, smoothing sampling vari- 
ances has strong implication in measuring uncertainty and 
hence in the interval estimation. The proposed method has 
the shortest confidence interval on an average compared to 
all other methods. Methods II and III provide intervals with 
negative lower limits. This seems unrealistic because the 
direct average of area under corn is positive and large for all 
the 12 counties (the crude confidence intervals (x, + 
too259;) do not contain zero for any of the areas either). 
Note that Method II does not have any theoretical support 
on its confidence intervals. Methods II and II produce 
wider confidence intervals when the sampling variance is 
high. For example, the sample size for both Franklin county 
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and Pocahontas county is three, but sampling standard 
deviations are 5.704 and 43.406. Although the confidence 
interval under Method I is comparable, they are wide apart 
for Methods II and III. This is because although these 
methods consider the uncertainty in sampling variance 
estimates, the smoothing did not use the information from 
direct survey estimates, resulted the underlying sampling 
variance estimates remain highly variable (due to small 
sample size). In effect, the variance of the variance estimator 
(of the point estimates) is bigger compared to that in method 
I. This is further confirmed by the fact that the intuitive 
standard deviations of the “smoothed” small area estimates 
(one fourth of the interval) are smaller and less variable 
under method I compared to the others. Another noticeable 
aspect of our method is that the interval widths are similar 
for counties with same sample size. This could be an indi- 
cation of obtaining equ-efficient estimators for equivalent 
sample sizes. 


Model selection: For choosing the best fitting model, we 
used the Bayesian Information Criteria (BIC) which takes 
into account both the likelihood as well as the complexity of 
the fitted models. We calculated BICs for the models used 
in Methods I and III (Hwang et al. 2009). These two models 
have the same numbers of parameters with a difference in 
only the way the parameters are estimated. The model BIC 
for Method I is 210.025 and that for Method III is 227.372. 
This indicates superiority of our model. We could not 
compute the BIC for Wang and Fuller (2003) since they did 
not use any explicit likelihood. 


Table 7 

Results of the corn data analysis. Here CI and LCI represent the confidence interval and the length of the confidence interval, 

respectively 
County 6, Cl LCI 6, Cl LCI 

I: Proposed method II: Wang and Fuller (2003) 
Franklin 131.8106 104.085, 159.372 Dewy yl 155.4338 124.151, 193.094 68.943 
Pocahontas 108.7305 80.900, 136.436 55.536 102.3682 -38.973, 244.019 282.993 
Winnebago 109.0559 81.430, 136.646 55.216 115.9093 -53.768, 279.314 333.083 
Wright 131.6113 103.736, 159.564 55.828 131.0674 8.330, 280.263 241.932 
Webster 113.1484 92.805, 133.348 40.543 109.4795 32.514, 202.675 170.161 
Hancock 129.4279 111.781, 147.193 35.412 124.1028 56.750, 162.013 105.262 
Kossuth 121.0071 103.451, 138.626 85.175 116.7147 68.049, 152.454 84.405 
Hardin 130.2520 112.373, 148.114 35.741 137.7983 51.734, 188.373 136.638 
I: Hwang ef al. (2009) IV: Qiu and Hwang (2007) 

Franklin 158.4677 128.564, 188.370 59.805 157.7383 146.999, 168.477 21.478 
Pocahontas 100.1276 -44.039, 244.295 288.334 101.1661 19.444, 182.887 163.442 
Winnebago 114.1473 0.065, 228.228 228.163 113.7746 56.263, 171.286 115.022 
Wright 140.3717 -24.119, 304.862 328.982 143.2244 41.559, 244.889 203.330 
Webster 115.7865 S029 TT. LENT 130.978 115.2224 75.124, 155.320 80.196 
Hancock 111.3087 66.213, 156.403 90.189 113.1766 83.691, 142.661 58.970 
Kossuth 110.9585 74.366, 147.550 73.184 17-3239 89.320; 139.827 45.607 
Hardin 126.6093 40.040, 213.178 13.137 123.9049 54.607, 193.202 138.594 
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Figure 1 Corn hectares estimation. The horizontal line for each county displays the confidence interval of 6, with 6, marked by 
the circle, for (1) Proposed method, (II) Wang and Fuller (2003), (IID) Hwang etal. (2009) and (IV) Qiu and 
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Figure 2 Boxplot of estimates of corn hectares for each county. 
(I) to (IV) are the 4 methods corresponding to 
Figure 1 
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7. Conclusion 


In this paper, joint area level modeling of means and 
variances is developed for small area estimation. The re- 
sulting small area estimators are shown to be more efficient 
than the traditional estimators obtained using Fay-Herriot 
models which only shrink the means. Although our model is 
same as one considered in Hwang e¢ al. (2009), our method 
of estimation is different in two ways: In the determination 
of the tuning parameter k and the use of 2(o% | X,, S;, Z,) 
(which depends additionally on Y,), instead of 1(o? pS, 
Z,), for constructing the conditional distribution of the 
small area parameters 9,. We demonstrated robustness 
properties of the model when the assumption that o7 arise 
from a inverse Gamma distribution is violated. The bor- 
rowing of XY, information when estimating co? as well as 
the robustness with respect to prior elicitation demonstrate 
the superiority of our proposed method. The parameter 
values chosen in the simulation study are different than in 
the real data analysis. The real data analysis given here is 
merely for illustration purposes. Our main aim was to 
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develop the methodology of mean-variance modeling and 
contrast with some closely related methods to show its 
effectiveness. For this reason, we chose parameter settings 
in the simulation to be the same as in the well-known small 
area estimation article Wang and Fuller (2003). 

Obtaining improved sampling variance estimators is a 
byproduct of the proposed approach. We have provided an 
innovative estimation technique which is theoretically justi- 
fied and user friendly. Computationally, the method is much 
simpler compared to other competitive methods such as 
Bayesian MCMC procedures or bootstrap resampling meth- 
ods. We need sampling from posterior distribution only 
once during the model parameter estimation, and the sam- 
pled values can be used subsequently for all other purposes. 
The software is available from the authors upon request. 
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Appendix 
A. Derivation of the conditional distributions 


From Equation (1) and (2), the conditional joint distribu- 
HOWO GO. XS ae0s O70, Dp, t ), 1S 
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Therefore the conditional distribution of o; and 0, given 
the data and B are 


Go| es 2. Bil 
l 


(Gi haath saa (apace ai 
XiZe IBY od? (rl 4 | 

exp nica ory od ree a wpemce a Ul 
LCOS heey 2 b)\ 0; 

x(0,|X,, 82, Z, B) = | n(X,, S?,0,,67|Z,, B)do? 

an” 2 =| Fita 

« exp] © iB) lv." | 


a fre, See 0, as |Z;, B)d0,x 


25; 
where y, is defined in Equation (4). 


B. Details of the EM algorithm 


The maximization of O(B|B“) is done by setting the 
partial derivatives with respect to B to be zero, that is, 

6O(B| BS?) _ 

an we 

From the expression of O(B| B“”) in the text, we give 

explicit expressions for the partial derivates with respect to 


each component of B. The partial derivative corresponding 
to B is 


0. (B.1) 


0O(B |B“) 


[2 “ =F el (0, si B)’ |v? 
z T 25 


- Jexp Madde at By ) dO 


eet] 


where the expectation is with respect to the conditional 
distribution of 0,, 7(0, | X,,S;,.B). The expression of the 
partial derivative corresponding to t” is: 


i 
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Similarly for a and b, we get the solutions by setting 
S, =0 and S, =0 where S, and S, are, respectively, 
the partial derivatives of O(B|B“"") with respect to a 
and b with expressions given in the main text. These 
equations are solved using the Newton-Raphson method 
which requires the matrix of second derivatives with respect 
to a and b. These are given by the following expressions: 


Sa 5 tog" (r(% + “\} 


— log" {T'(a)} + cn) 


2 
Bes (Bea) aNatiiess 
Wi 2 b Vi 


with S,, ='S),; “At the u" step, the update of a and b are 


given by 
- u- Fen ie u- 
ie) gq 1) Se 1) Se 1) Se 1) 
= - (BS) 
(21) (u-1) (u-1) (u-1) (u-1) 
b b Sra Sos S; 
where the superscript G—abyeonSe> 5 7; S555; 2 and 
S,, denote these quantities evaluated at the values of a and 
b at the (w—1)" iteration. Once the Newton Raphson 
procedure converges, the value of a and b at the rt" step 


of the EM algorithm is set as a“ = a and 6 = b. 


ab? 


C. An alternative small area model formulation 


It is possible to reduce the width of the confidence 
interval C(B) based on an alternative hierarchical model 
for small area estimation which has some mathematical 
elegance. The constant term n, + 2a+2 in (19) becomes 
n, + 2a in this alternative model formulation. The model is 
given by 


X; | 9;, oO ad N(O,, amin (C1) 
8,| 5; ~ N(Z;B,A0;), (C.2) 
ped De 7 
Misael, OPIS (C3) 
0; 
on ~ Inverse —Gamma(a,b), (C4) 
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independently for i = 1,2,...,7. Note that in the above 
formulation, it is assumed that the conditional variance of 
0, is proportional to o; whereas the marginal variance is 
constant (by integrating out G, using (C.4). In (1) and (2), 
the variance of 0, is a constant, t°, independent of oy 
and there is no conditional structure for 0, depending on 
o;. The set of all unknown parameters in the current hier- 
archical model is B = (a, b, B, 4). The inference procedure 
for this model is given subsequently. The model essentially 
assumes that the true small area effects are not identically 
distributed even after eliminating the known variations. 


C.1 Inference methodology 


By re-parameterizing the variance as in (C.2), some 
analytical simplifications are obtained in the derivation of 
the posteriors of 8, and o, given X,, S; and B. We have 


t(o, | X,, 57, B) 


5 2 CES: 


where /G(a,b) stands for the inverse Gamma distribution 
with shape and scale parameters a and 5b, respectively. 
Given B and o;, the conditional distribution of 0, is 
; ho; 
1(0; | X,,0;, B) = Normal i Ween 
T+A 

Integrating out eee one obtains the conditional distribution 
of 8, given X,, S? and B, 
(On| XiS 8B) 
- | “n(0,|X,,0°, B) n(o?| X,, S?, B) do” 


0 
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ee (0) 768) eee 
| a (0, — Z; B) 5 (C.5) 
where 8° = (n,—1)S?+(X,- Z,B)’/(1+4)+2/b. We 
can rewrite (C.5) as 
T((n, + 1)/2+a)./1+A 
5 T(n,/2 + a). (n,+2a)rn 


2 —(n;+2a+1)/2 
lee (9, 54 Ll; ) 
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which can be seen to be a scaled t-distribution with n, + 2a 
degrees of freedom and scale parameter 5. 2/(1 +A) 
with 8° = 8°/(n, + 2a). Hence, 
T((n, + 1)/2 + a(S? /2) rrr 
T'(n,/2 + a)(8? /2) 
_ 1@,+ D2 +4) 2 
[(n, /2 + a) 


n(0,|X,,S;, B) = 


(Gy Ses Sey Be 


5 n,+ 2a 


Survey Methodology, December 2012 


In this context, choosing 


2 =( Lal) 2 
e 2 teyeeitiele2? pth athens 
j — | J r [27 
the confidence interval in (8) simplifies to 
ca al le 
Gi" Vee iae:  e (CH 
V1+a n.—1 


Using the similar arguments as before and noting that 
(n,+2a)5° > (n,-1)S;, we have P{C.(B)} > P(D,) = 
l1—a where D, is the confidence interval in (20). When 
B is unknown, we replace B by its marginal maximum 
likelihood estimate B. It is expected that the pooling 
technique will result in an error small enough so that 
P{C,(B)} = P{C,(B)} = 1-a. 
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Condition indexes and variance decompositions for diagnosing 
collinearity in linear model analysis of survey data 


Dan Liao and Richard Valliant ! 


Abstract 


Collinearities among explanatory variables in linear regression models affect estimates from survey data just as they do in 
non-survey data. Undesirable effects are unnecessarily inflated standard errors, spuriously low or high ¢fstatistics, and 
parameter estimates with illogical signs. The available collinearity diagnostics are not generally appropriate for survey data 
because the variance estimators they incorporate do not properly account for stratification, clustering, and survey weights. In 
this article, we derive condition indexes and variance decompositions to diagnose collinearity problems in complex survey 
data. The adapted diagnostics are illustrated with data based on a survey of health characteristics. 


Key Words: Diagnostics for survey data; Multicollinearity; Singular value decomposition; Variance inflation. 


1. Introduction 


When predictor variables in a regression model are 
correlated with each other, this condition is referred to as 
collinearity. Undesirable side effects of collinearity are 
unnecessarily high standard errors, spuriously low or high 
t-statistics, and parameter estimates with illogical signs or 
ones that are overly sensitive to small changes in data 
values. In experimental design, it may be possible to create 
situations where the explanatory variables are orthogonal to 
each other, but this is not true with observational data. 
Belsley (1991) noted that: “... in nonexperimental sciences, 
..., collinearity is a natural law in the data set resulting from 
the uncontrollable operations of the data-generating mecha- 
nism and is simply a painful and unavoidable fact of life.” In 
many surveys, variables that are substantially correlated are 
collected for analysis. Few analysts of survey data have 
escaped the problem of collinearity in regression estimation, 
and the presence of this problem encumbers precise sta- 
tistical explanation of the relationships between predictors 
and responses. 

Although many regression diagnostics have been de- 
veloped for non-survey data, there are considerably fewer 
for survey data. The few articles that are available concen- 
trate on identifying influential points and influential groups 
with abnormal data values or survey weights. Elliot (2007) 
developed Bayesian methods for weight trimming of linear 
and generalized linear regression estimators in unequal 
probability-of-inclusion designs. Li (2007a, b) and Li and 
Valliant (2009, 2011) extended a series of traditional diag- 
nostic techniques to regression on complex survey data. 
Their papers cover residuals and leverages, several diag- 
nostics based on case-deletion (DFBETA, DFBETAS, 
DFFIT, DFFITS, and Cook’s Distance), and the forward 
search approach. Although an extensive literature in applied 


Statistics provides valuable suggestions and guidelines for 
data analysts to diagnose the presence of collinearity (e.g., 
Belsley, Kuh and Welsch 1980; Belsley 1991; Farrar and 
Glauber 1967; Fox 1986; Theil 1971), almost none of this 
research touches upon diagnostics for collinearity when 
fitting models with survey data. One prior, survey-related 
paper on collinearity problems is (Liao and Valliant 2012) 
which adapted variance inflation factors for linear models 
fitted with survey data. 

Suppose the underlying structural model in the super- 
population is Y = XB+e. The matrix X is an nx p 
matrix of predictors with n being the sample size; B is a 
p x1 vector of parameters. The error terms in the model 
have a general variance structure e ~ (0, o°R) where o 
is an unknown constant and R is a unknown nxn 
covariance matrix. Define W to be the diagonal matrix of 
survey weights. We assume throughout that the survey 
weights are constructed in such a way that they can be used 
for estimating finite population totals. The survey weighted 
least squares (SWLS) estimator is 


Boy = (X’WX) 7X’ WY = A'X’ WY, 


assuming A = X’W'X_ is invertible. Fuller (2002) 
describes the properties of this estimator. The estimator 
Boy is model unbiased for B under the model Y = 
XB +e regardless of whether Var,,(e) = o°R is speci- 
fied correctly or not, and is approximately design-unbiased 
for the census parameter B,, = (X/,X,,) | X,Y, in the 
finite population U of N_ units. The finite population 
values of the response vector and matrix of predictors are 
Yy =(Yp.Y%y), and Xy =(X,..,X,) with X, 
being the NV x1 vector of values for covariate k. 

The remainder of the paper is organized as follows. 
Section 2 reviews results on condition numbers and variance 


1. Dan Liao, RTI International, 701 13" Street, N.W., Suite 750, Washington DC, 20005. E-mail: dliao@rti.org; Richard Valliant, University of Michigan 
and University of Maryland, Joint Program in Survey Methodology, 1218 Lefrak Hall, College Park, MD, 20742. 
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decompositions for ordinary least squares. These are ex- 
tended to be appropriate for survey estimation in section 3. 
The fourth section gives some numerical illustrations of the 
techniques. Section 5 is a conclusion. In most derivations, 
we use model-based calculations since the forms of the 
model-variances are useful for understanding the effects of 
collinearity. However, when presenting variance decompo- 
sitions, we use estimators that have both model- and design- 
based justifications. 


2. Condition indexes and variance decompositions 
in ordinary least squares estimation 


In this section we briefly review techniques for diag- 
nosing collinearity in ordinary least squares (OLS) esti- 
mation based on condition indexes and variance decompo- 
sitions. These methods will be extended in section 3 to 
cover complex survey data. 


2.1 Eigenvalues and eigenvectors of X’X 


When there is an exact (perfect) collinear relation in the 
nx p data matrix X, we can find a set of values, v = 
Gis ayant all zero, such that 


Vi ee Pk 0) Ole Nea (1) 


However, in practice, when there exists no exact collinearity 
but some near dependencies in the data matrix, it may be 
possible to find one or more non-zero vectors v_ such that 
Xv =a with a+#0 but close to 0. Alternatively, we 
might say that a near dependency exists if the length of 
vector a, ||a/||, is small. To normalize the problem of 
finding the set of v’s that makes || a || small, we consider 
only v with unit length, that is, with || v || =1. Belsley 
(1991) discusses the connection of the eigenvalues and 
eigenvectors of X’X with the normalized vector v and 
|| a ||. The minimum length || a || is simply the positive 
square root of the smallest eigenvalue of X’X. The v that 
produces the a with minimum length must be the 
eigenvector of X’X_ that corresponds to the smallest 
eigenvalue. As discussed in the next section, the eigenvalues 
and eigenvectors of X are related to those of X’X and 
have some advantages when examining collinearity. 


2.2 Singular-value decomposition, condition number 
and condition indexes 


The singular-value decomposition (SVD) of matrix X is 
very closely allied to the eigensystem of X’X, but with its 
own advantages. The n x p matrix X can be decomposed 
as X= UDV’, where U'U=V'V= Pea aaD = 
diag(u,,...,H,,) 1s the diagonal matrix of singular values 
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(or eigenvalues) of X. Here, the three components in the 
decomposition are matrices with very special, highly 
exploitable properties: U is nx p (the same size as X) 
and is column orthogonal; V is px p and both row and 
column orthogonal; D is p x p, nonnegative and diagonal. 
Belsley etal. (1980) felt that the SVD of X has several 
advantages over the eigen system of X’X, for the sake of 
both statistical usages and computational complexity. For 
prediction, X is the focus not the cross-product matrix 
X'X since Y = Xf. In addition, the lengths || a || of the 
linear combinations (1) of X that relate to collinearity are 
properly defined in terms of the square roots of the 
eigenvalues of X’X, which are the singular values of X. 
A secondary consideration, given current computing power, 
is that the singular value decomposition of X avoids the 
additional computational burden of forming X‘X, an 
operation involving np* unneeded sums and products, 
which may lead to unnecessary truncation error. 

The condition number of X is defined as «K(X) = 
Uinax / mins Where u,,, and p,,,, are the maximum and 
minimum singular values of X. Condition indexes are 
defined as N, = Unax / Uz. The closer that t1,,;,, 18 to zero, 
the nearer X'X is to being singular. Empirically, if a value 
of « or yn exceeds a cutoff value of, say, 10 to 30, two or 
more columns of X have moderate or strong relations. The 
simultaneous occurrence of several large n,’s is always 
remarkable for the existence of more than one near 
dependency. 

One issue with the SVD is whether the X’s should be 
centered around their means. Marquardt (1980) maintained 
that the centering of observations removes nonessential ill 
conditioning. In contrast, Belsley (1984) argues that mean- 
centering typically masks the role of the constant term in 
any underlying near-dependencies. A typical case is a 
regression with dummy variables. For example, if gender is 
one of the independent variables in a regression and most of 
the cases are male (or female), then the dummy for gender 
can be strongly collinear with the intercept. The discussions 
following Belsley (1984) illustrate the differences of opin- 
ion that occur among practitioners (Wood 1984; Snee and 
Marquardt 1984; Cook 1984). Moreover, in linear regres- 
sion analysis, Wissmann, Toutenburg and Shalabh (2007) 
found that the degree of multicollinearity with dummy 
variables may be influenced by the choice of reference 
category. In this article, we do not center the X’s but will 
illustrate the effect of the choice of reference category in 
section 4. 

Another problem with the condition number is that it is 
affected by the scale of the x measurements (Steward 
1987). By scaling down any column of X, the condition 
number can be made arbitrarily large. This situation is 
known as artificial ill-conditioning. Belsley (1991) suggests 
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scaling each column of the design matrix X using the 
Euclidean norm of each column before computing the 
condition number. This method is implemented in SAS and 
the package perturb of the statistical software R (Hendrickx 
2010). Both use the root mean square of each column for 
scaling as its standard procedure. The condition number and 
condition indexes of the scaled matrix X are referred to as 
the scaled condition number and scaled condition indexes of 
the matrix X. Similarly, the variance decomposition pro- 
portions relevant to the scaled X (which will be discussed 
in next section) will be called the scaled variance decom- 
position proportions. 


2.3 Variance decomposition method 


To assess the extent to which near dependencies (i.e., 
having high condition indexes of X and X’X) degrade 
the estimated variance of each regression coefficient, 
Belsley et al. (1980) reinterpreted and extended the work 
of Silvey (1969) by decomposing a coefficient variance 
into a sum of terms each of which is associated with a 
singular value. In the remainder of this section, we 
review the results of ordinary least squares (OLS) under 
the model E£,,(Y) = XB and Var,,(Y) = ol, where 
I, is the nxn identity matrix. These results will be 
extended to survey weighted least squares in section 3. 
Recall that the model variance-covariance matrix of the 
OLS estimator B =1 0.4. Qh 2, Gan Ge Var,/(B) Sen Oso 
X)"|. Using the SVD, X=UDV’, Var,,(B) can be writ- 
ten as: 


Var,,(B) = o° [(UDV’ )’(UDV’ J! = 0° VD°V" (2) 


and the k" diagonal element in Vary (B) is the estimated 
variance for the k" coefficient, B,. Using (2), Var,,(B,) 
can be expressed as: 


Var, (B,) = 0° 2”. el 3) 
KW 


ae Var Ved rscneth Lele, Dy = Valablas o, = 2710, and 
TalOnJevoan(\D,, ').-(VD"'), ee Nica is the Hadamard 
nee product. The variance-decomposition propor- 
tions are t= ,/,, Which is the proportion of the 
variance of the k" regression coefficient associated with 
the j" component of its decomposition in (3). Denote the 
variance decomposition proportion matrix as II = 
(Tx) pxp = QQ”, where Q is the diagonal matrix with 
the row sums of Q on the main diagonal and 0 elsewhere, 
If the model is £,,(Y) = XB, Var,,(Y) = 9 “Ww and 
weighted least squares is used, then Se (X7 WX)! 
X"WY and War t(Bisn«) = 0° (X’ WX)". The decom- 
position in (3) holds with X = W'’X being decomposed 
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as X = UDV’. However, in survey applications, it will 
virtually never be the case that the covariance matrix of Y 
is o W | if W is the matrix of survey weights. Section 3 
covers the more realistic case. 

In the variance decomposition (3), other things being 
equal, a small singular value pp, can lead to a large 
component of Var(B, ). However, if v,, is small too, then 
Var( 6 .) may not be affected by a small 1 ;. One extreme 
case is when v,, = 0. Suppose the k™ and j" columns of 
X belong to separate orthogonal blocks. Let X = 
[X,, X,] with X/X, =0 and let the singular-value de- 
compositions of X, and X, be given, respectively, as 
X,=U,D,,V,, and X, = U,D,,V;,. Since U, and U, 
are the orthogonal bases for the space spanned by the 
columns of X, and X, respectively, X/X, = 0 implies 


U/U, =0 and U =[U,, U,] is column orthogonal. The 
singular value decomposition of X is simply X = UDUS, 
with: 
_|D,, 0 
sy 
and 
Fy 708 
ate ast ® 


Thus V,, = 0. An analogous result clearly applies to any 
number of mutually orthogonal subgroups. Hence, if all the 
columns in X are orthogonal, all the v,, = 0 when k # 
and 7,, = 0 likewise. When v,, is nonzero, this is a signal 
that predictors k and 7 are not orthogonal. 

Since at least one v,, must be nonzero in (3), this implies 
that a high proportion of any variance can be associated 
with a large singular value even when there is no 
collinearity. The standard approach is to check a high 
condition index associated with a large proportion of the 
variance of two or more coefficients when diagnosing 
collinearity, since there must be two or more columns of X 
involved to make a near dependency. Belsley et al. (1980) 
suggested showing the matrix II and condition indexes of 
X in a variance decomposition table as below. If two or 
more elements in the j" row of matrix IT are relatively 
large and its associated condition index 1, is large too, it 
signals that near dependencies are influencing regression 
estimates. 


Condition Proportions of variance 
Index Vary(B,) Vara, (B2) Vary (B,) 
"1 Be ses oe TU p 
N2 21 M9 es Tp 
Np Tp) T p2 es T pp 
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3. Adaptation in survey-weighted least squares 


3.1 Condition indexes and variance decomposition 
proportions 


In survey-weighted least squares (SWLS), we are more 
interested in the collinear relations among the columns in 
the matrix K = W'’X instead of X, since 6e = (x! 
X)'! XY. Define the singular value decomposition of X to 
be X = UDV’, where U, V, and D are usually different 
from the ones of X, due to the unequal survey weights. 

The condition number of X is defined as K(X) = 
Uinax / Himins Where Hinax aNd Hi, are maximum and mini- 
mum singular values of X. The condition number of X is 
also usually different from the condition number of the data 
matrix X due to unequal survey weights. Condition indexes 
are defined as 

Te = mx! Pe, &=1,...P (6) 
where 1, is one of the singular values of X. The scaled 
condition indexes and condition numbers are the condition 
indexes and condition numbers of the scaled X. 

Based on the extrema of the ratio of quadratic forms (Lin 
1984), the condition number K(X) is bounded in the range 
of: 


1/2 1/2 


Wari ~ w 
min max 
max Winin 

where w,,,, and w,,,, are the minimum and maximum 


survey weights. This expression indicates that if the survey 
weights do not vary too much, the condition number in 
SWLS resembles the one in OLS. However, in a sample 
with a wide range of survey weights, the condition number 
can be very different between SWLS and OLS. When 
SWLS has a large condition number, OLS might not. In the 
case of exact linear dependence among the columns of X, 
the columns of X will also be linearly dependent. In this 
extreme case at least one eigenvalue of X will be zero, and 
both K(X) and K(X) will be infinite. As in OLS, large 
values of « or of the n,’s of 10 or more may signal that 
two or more columns of X have moderate to strong 
dependencies. 

The model variance of the SWLS parameter estimator 
under a model with Var,,(e) = o°R is: 


Vary, (Bsw ) = 0° (X’ WX) 'X’WRWX(X’ WX) ! 
= 0°(X’X)'G, (8) 
where 
G = (8;) xp = X WRWX(X' WX)" (9) 
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is the misspecification effect (MEFF) that represents the 
inflation factor needed to correct standard results for the 
effect of intracluster correlation in clustered survey data and 
for the fact that Var,,(e) = o°R and not o W (Scott 
and Holt 1982). 

Using the SVD of X, we can rewrite Var,, (Geo as 


Var, (Bsy) = 0 WD’ V'G. (10) 


The k“ diagonal element in Vary (eee is the estimated 
variance for the k" coefficient, B,. Using (10), Var,,(, ) 
can be expressed as: 


" . Vv, 
Vary (B,) = o 24 a Neg (11) 
J 
where he P22. oe. Dei Re Ww, thn G=I, 
Ny = Vy» and (11) reduces to (3). However, the situation is 
more complicated when G is not the identity matrix, i.e., 
when the complex design affects the variance of an 
estimated regression coefficient. If predictors k and j are 
orthogonal, v, = 0 for k # j and the variance in (11) 
depends only on the &™ singular value and is unaffected by 
g,’s that are non-zero. If predictor k and several j’s are 
not orthogonal, then i,, has contributions from all of those 
eigenvectors and from the off-diagonal elements of the 
MEFF matrix G. The term ,, then measures both non- 
orthogonality of x’s and effects of the complex design. 
Consequently, we can define variance decomposition 
proportions analogous to those for OLS but their 
interpretation is less straightforward. Let 0,,= Vj, A,;/ Tie 
6,= 274o, and Q=(9,),.,.= (VD~)-(V"G)". The 
variance-decomposition proportions are 1, =, / 0, 
which is the proportion of the variance of the k" regression 
coefficient associated with the j‘ component of its decom- 
position in (11). Denote the variance decomposition pro- 
portion matrix as 


LOG Od OP (12) 


where Q is the diagonal matrix with the row sums of Q on 
the main diagonal and 0 elsewhere. The interpretation of the 
proportions in (12) is not as clear-cut as for OLS because 
the effect of the MEFF matrix. Section 3.2 discusses the 
interpretation in more detail in the context of stratified 
cluster sampling. 

Analogous to the method for OLS regression, a variance 
decomposition table can be formed like the one at the end of 
section 2. When two or more independent variables are 
collinear (or “nearly dependent”), one singular value should 
make a large contribution to the variance of the parameter 
estimates associated with those variables. For example, if 
the proportions 7,, and 7,, for the variances of BLO: and 
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ee are large, this would say that the third singular value 
makes a large contribution to both variances and that the 
first and second predictors in the regression are, to some 
extent, collinear. As shown in section 2.3, when the ke and 
7" columns in X are orthogonal, v,, = 0 and the 7 sin- 
gular value’s decomposition proportion 7, on Var(B, ) i 
will be 0. 

Several special cases are worth noting. If R = W' 
assumed in WLS, then G = I. The variance decomposition 
in (11) has the same form as (2) in OLS. However, having 
R = W' in survey data would be unusual since survey 
weights are not typically computed based on the variance 
structure of a model. Note that V is still different from the 
one in OLS and is one component of the SVD of X instead 
of X. Another special case here is when R = I and the 
survey weights are equal, in which case the OLS results can 
be used. However, when the survey weights are unequal, 
even when R = I, the variance decomposition in (11) is 
different from (2) in OLS since G # I. In the next section, 
we will consider some special models that take the popu- 
lation features such as clusters and strata into account when 
estimating this variance decomposition. 


3.2 Variance decomposition for a model with 
stratified clustering 


The model variance of ua in (8) contains the unknown 
R_ that must be estimated. In this section, we present an 
estimator for pon that is appropriate for a model with 
stratified clustering. The variance estimator has both model- 
based and design-based justification. Suppose that in a 
stratified multistage sampling design, there are strata 
fe leet 1 Wie populavon, clusterss ¢— I> %, NV, “in 
stratum / and units ¢ = 1,...,.M,, in cluster hi. We select 
Clistetsot — 1... 4, 10. stratum a and_units 6 — 1... mr, 
in cluster Hi. Denote the set of sample clusters in stratum h 
by s, and the sample of units in cluster hi as s,,. The total 
number of sample units in stratum / is m, = Dies, ™M,; and 
the total in the sample is m = ¥/_,m,. Assume that clusters 
are selected with varying probabilities and with replacement 
within strata and independently between strata. The model 
we consider is: 


Evy Yr) = XpiB 
R= lod, gt — 1, Nt ek, ene 
Cova, Ess, i.) 0 
where &,,, = Youn —XpiB> i xi 
Cov (Eon) 0 WER. (13) 


Units within each cluster are assumed to be correlated but 
the particular form of the covariances does not have to be 
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specified for this analysis. The estimator B.y of the 
regression parameter can be written as: 


H 
ss » RX) XT, Y), (14) 


h=1 ies, 


where X,, 1s the m,, x p matrix of covariates for sample 
units in cluster hi, W,,; = diag(w,), f € s,,, is the diagonal 
matrix of survey weights for units in cluster Hi and Y,, is 
the m,, x 1 vector of response variables in cluster hi. The 


model variance of Boy 1s: 


Vary, (Bsw) SEX NGA (15) 
where 
fh ~ ~ 
— bp yx Wii R,, Wi XR) 
h=1 ies, 
Hf ~ ~ 
= oxi W, R,, W, x, |e) (16) 
h=| 
with Ri = Vary, ( Yi)» Was diag( W,,): and R, rx 
Blkdiag(R,,), W, = diag(W,,), Xj; = (Xin XjovsX ion, )> 
ie s,. Expression (16) is a special case of (9) with 


xX’ = (X/,X3,...,.X;,), where X, is the m, x p matrix 
of covariates for sample units in stratum h, W = 
diag(W,,), for h=1,..,H and ies, and R= 
Blkdiag(R, ). 

Based on the development in Scott and Holt (1982, section 
4), the MEFF matrix G,, can be rewritten for a special case 
of R, in a way that will make the decomposition 
proportions in (12) more understandable. Consider the 
special case of (13) with 
oo leals 


M),; Mp; 


Cova len.) wiOu\le Od 


My; 


where I, is the m,, x m,, identity matrix and 1, is a 


vector of; mM), 1’s. In that case, 
x, W,R, W,X, =(1—p)X, W, X, 


T 2 
asad © S Mn X gai Wii & Bri 


1ES), 


=m,,1,,1,,,X;;- Suppose that the sample is 
W,, = wI,,,- After some simplifi- 


where X 3, 
self-weighting so that W 
cation, it follows that 


G,, = w[I1, +(M-I,)p] 


st 


where I, is the px Pe se matrix and M = 
Oe Bi pte XG, Sa WX)". Thus, if the sample is 
self-weighting and p is very small, then G,, * wI,, and 
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Var, (Bow ) in (15) will be approximately the same as the 
OLS variance. If so, the SWLS variance decomposition pro- 
portions will be similar to the OLS proportions. In regres- 
sion problems, p often is small since it is the correlation of 
the errors, €,,, = Y,, — Xj, for different units rather than 
for Y,,,)s. This is related to the phenomenon that design 
effects for regression coefficients are often smaller than for 
means-a fact first noted by Kish and Frankel (1974). In 
applications where p is larger, the variance decomposition 
proportions in (12) will still be useful in identifying colli- 
nearity although they will be affected by departures of the 
model errors from independence. 

Denote the cluster-level residuals as a vector, e,, = Y,; — 
X,; Rar The estimator of (15) that we consider was origi- 
nally derived from design-based considerations. A lineariza- 
tion estimator, appropriate when clusters are selected with 
replacement, is: 


vat, (Ba,) =(X’X)'G, (17) 
with the estimated misspecification effect as 


G, ry C27) pan a 


H 
bp Nn, Diz; a Ca aa (X7 X97 (18) 


h=| Ny, Terthies, 


—* * 2 ps th . x 
where Zi, — 1/ Ny, Luies Zhi and Zhi xX, Writ ni with eh, — 
Y,, — X,;Bsy, and the variance-covariance matrix R can 
be estimated by 

l 
T 
Ny, 


Expression (17) is used by the Stata and SUDAAN pack- 
ages, among others. The estimator var, (oe) is consistent 
and approximately design-unbiased under a design where 
clusters are selected with replacement (Fuller 2002). The 
estimator in (17) is also an approximately model-unbiased 
estimator of (15) (see Liao 2010). Since the estimator 
var, (Bey) is also currently available in software packages, 
we will use it in the empirical work in section 4. 

Using (12) derived in section 2, the variance decomposi- 
tion proportion matrix I for var, (Bey) can then be 
written as 


A HN, 
Ries = Bliateye) - 


Ay, 


Me 004i, Fie Oe (19) 
with Q, = (,),.) =(WD~)-(W"G,)" and Q,, is the 


diagonal matrix with the row sums of Q, on the main 
diagonal and () elsewhere. 
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4. Numerical illustrations 


In this section, we will illustrate the collinearity measures 
described in section 3 and investigate their behaviors using 
the dietary intake data from 2007-2008 National Health and 
Nutrition Examination Survey (NHANES). 


4.1 Description of the data 


The dietary intake data are used to estimate the types and 
amounts of foods and beverages consumed during the 24- 
hour period prior to the interview (midnight to midnight), 
and to estimate intakes of energy, nutrients, and other food 
components from those foods and beverages. NHANES 
uses a complex, multistage, probability sampling design; 
oversampling of certain population subgroups is done to 
increase the reliability and precision of health status indi- 
cator estimates for these groups. Among the respondents 
who received the in-person interview in the mobile exami- 
nation center (MEC), around 94% provided complete di- 
etary intakes. The survey weights were constructed by 
taking MEC sample weights and further adjusting for the 
additional nonresponse and the differential allocation by day 
of the week for the dietary intake data collection. These 
weights are more variable than the MEC weights. The data 
set used in our study is a subset of 2007-2008 data com- 
posed of female respondents aged 26 to 40. Observations 
with missing values in the selected variables are excluded 
from the sample which finally contains 672 complete re- 
spondents. The final weights in our sample range from 
6,028 to 330,067, with a ratio of 55:1. The U.S. National 
Center for Health Statistics recommends that the design of 
the sample is approximated by the stratified selection with 
replacement of 32 PSUs from 16 strata, with 2 PSUs within 
each stratum. 


4.2 Study one: Correlated covariates 


In the first empirical study, a linear regression model of 
respondent’s body mass index (BMI) was considered. The 
explanatory variables considered included two demographic 
variables, respondent’s age and race (Black/Non-black), 
four dummy variables for whether the respondent is on a 
special diet of any kind, on a low-calorie diet, on a low-fat 
diet, and on a low-carbohydrate diet (when he/she is on diet, 
value equals 1, otherwise 0), and ten daily total nutrition 
intake variables, consisting of total calories (100kcal), pro- 
tein (100gm), carbohydrate (100gm), sugar (100gm), dietary 
fiber (100gm), alcohol (100gm), total fat (100gm), total 
saturated fatty acids (100gm), total monounsaturated fatty 
acids (100gm), and total polyunsaturated fatty acids 
(100gm). The correlation coefficients among these variables 
are displayed in Table 2. Note that the correlations among 
the daily total nutrition intake variables are often high. For 
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example, the correlations of the total fat intakes with total 
saturated fatty acids, total monounsaturated fatty acids and 
total polyunsaturated fatty acids are 0.85, 0.97 and 0.93. 

Three types of regressions were fitted for the selected 
sample to demonstrate different diagnostics. More details 
about these three regression types and their diagnostic statis- 
tics are displayed in Table 1. 


TYPEI: OLS regression with estimated o*; the diagnostic 
statistics are obtained using the standard methods reviewed 
in section 2; 


TYPE2: WLS regression with estimated o* and assuming 
R = W'; the scaled condition indexes are estimated using 
(6) and the scaled variance decomposition proportions are 
estimated using (12). With R= W', these are the 
variance decompositions that will be produced by standard 
software using WLS and specifying the weights to be the 
survey weights; 


TYPE3: SWLS with estimated R; the scaled condition 
indexes are estimated using (6); the scaled variance 
decomposition proportions are estimated using (12). 


Their diagnostic statistics, including the scaled condition 
indexes and variance decomposition proportions are 
reported in Tables 3, 4 and 5, respectively. To make the 
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tables more readable, only the proportions that are larger 
than 0.3 are shown. Proportions that are less than 0.3 are 
shown as dots. Note that some terms in decomposition (12) 
can be negative. This leads to the possibility of some 
“proportions” being greater than 1. This occurs in five cases 
in Table 5. Belsley etal. (1980) suggest that a condition 
index of 10 signals that collinearity has a moderate effect on 
standard errors; an index of 100 would indicate a serious 
effect. In this study, we consider a scaled condition index 
greater than 10 to be relatively large, and ones greater than 
30 as large and remarkable. Furthermore, the large scaled 
variance-decomposition proportions (greater than 0.3) 
associated with each large scaled condition index will be 
used to identify those variates that are involved in a near 
dependency. The intracluster correlation of the residuals is 
shown in the last row of Table 6 under the column labeled 
“Original Model”. In the model used for Tables 3-5, p = 
0.0366 as estimated from a model with random effects for 
clusters. As noted in section 3.2, when p is small and the 
sample is self-weighting, the SWLS decomposition propor- 
tions can be interpreted in the same way as those of OLS. 
Although the NHANES sample does not have equal 
weights, p is small in this example and the decomposition 
proportions should still provide useful information. 


Regression models and their collinearity diagnostic statistics used in this experimental study 


Type Regression Weight 


Method matrix ae) 
Ww’ 
TYPEIs’ sObSs I aXe Xx) 4 
TYPE2 WLS Ww 6 (X7 WX)! 
TYPE3 SWLS WwW 6?(X" WX) 'X’WRWX(X’WX)! 


=F "1 | aiciagteye) —— y= 


ny, h 


The terms w,, 
The terms w,,, 
The terms w,,, 
element of misspecification effect matrix G. 


var(p ) Matrix for Variance Decomposition 
‘ Condition Proportion 7, 
Indexes’ 
KOK ue us 
2 2k 2k 7 2k 
oxi, 2 =o / Lia 
Lt; HM; H; 
X’WX u; Uy 
ot? — 4 t/a 
Om LM Hj 
yA T aye 
oT? Upp 2518 ipo; xX WX u apt eee ij / 2? aej mimi inll 
i=l 2 2 
. My; My, Mi 


In all the regression models, the parameters are estimated by: B = (X’WX) 'X’WY. 

The eigenvalues of this matrix will be used to compute the Condition Indexes for the corresponding regression model. 

and yw, are from the singular value decomposition of the data matrix X. 

and wt, are from the singular value decomposition of the weighted data matrix Ko WieX 

and p, are from the singular value decomposition (SVD) of the weighted data matrix X. The term g, 1s the unit 
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In Tables 3, 4 and 5, the weighted regression methods, 
WLS and SWLS, used the survey-weighted data matrix X 
to obtain the condition indexes while the unweighted 
regression method, OLS, used the data matrix X. The 
largest scaled condition index in WLS and SWLS is 566, 
which is slightly smaller than the one in OLS, 581. Both of 
these values are much larger than 30 and, thus, signal a 
severe near-dependency among the predictors in all three 
regression models. Such large condition numbers imply that 
the inverse of the design matrix, x’ Wx, may be nu- 
merically unstable, i.e., small changes in the x data could 
make large changes in the elements of the inverse. 

The values of the decomposition proportions for OLS 
and WLS are very similar and lead to the same predictors 
being identified as potentially collinear. Results for SWLS 
are somewhat different as sketched below. In OLS and 
WLS, six daily total nutrition intake variables-calorie, 
protein, carbohydrate, alcohol, dietary fiber and total fat-are 
involved in the dominant near-dependency that is associated 
with the largest scaled condition index. Four daily fat intake 
variables, total fat, total saturated fatty acids, total monoun- 
saturated fatty acids and total polyunsaturated fatty acids, 
are involved in the secondary near-dependency that is 
associated with the second largest scaled condition index. A 
moderate near-dependency between intercept and age is also 
shown in all three tables. The associated scaled condition 
index is equal to 38 in OLS and 37 in WLS and SWLS. 
However, when SWLS is used, sugar, total saturated fatty 
acids and total polyunsaturated fatty acids also appear to be 
involved in the dominant near-dependency as shown in 
Table 5. While, only three daily fat intake variables, total 
saturated fatty acids, total monounsaturated fatty acids and 
total polyunsaturated fatty acids, are involved in the 
secondary near-dependency that is associated with the 
second largest scaled condition index. Thus, when OLS or 
WLS is used, the impact of near-dependency among sugar, 
total saturated fatty acids, total polyunsaturated fatty acids 
and the six daily total nutrition intake variables is not as 
strong as the ones in SWLS. If conventional OLS or WLS 
diagnostics are used for SWLS, this near-dependency might 
be overlooked. 

Rather than using the scaled condition indexes and 
variance decomposition method (in Tables 3, 4 and 5), an 
analyst might attempt to identify collinearities by examining 
the unweighted correlation coefficient matrix in Table 2. 
Although the correlation coefficient matrix shows. that 
almost all the daily total nutrition intake variables are highly 
or moderately pairwise correlated, it cannot be used to 
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reliably identify the mear-dependencies among these 
variables when used in a regression. For example, the 
correlation coefficient between “on any diet” and “on low- 
calorie diet” is relatively large (0.73). This near dependency 
is associated with a scaled condition index equal to 11 
(larger than 10, but less than the cutoff of 30) in OLS and 
WLS (shown in Table 3 and 4) and is associated with a 
scaled condition index equal to 2 (less than 10) in SWLS 
(shown in Table 5). The impact of this near dependency 
appears to be not very harmful not matter which regression 
method is used. On the other hand, alcohol is weakly 
correlated with all the daily total nutrition intake variables 
but is highly involved in the dominant near-dependency 
shown in the last row of Tables 3-5. 

After the collinearity patterns are diagnosed, the common 
corrective action would be to drop the correlated variables, 
refit the model and reexamine standard errors, collinearity 
measures and other diagnostics. Omitting X’s one at a time 
may be advisable because of the potentially complex 
interplay of explanatory variables. In this example, if the 
total fat intake is one of the key variables that an analyst 
feels must be kept, sugar might be dropped first followed by 
protein, calorie, alcohol, carbohydrate, total fat, dietary 
fiber, total monounsaturated fatty acids, total polyun- 
saturated fatty acids and total saturated fatty acids. Other 
remedies for collinearity could be to transform the data or 
use some specialized techniques such as ridge regression 
and mixed Bayesian modeling, which require extra (prior) 
information beyond the scope of most research and 
evaluations. 

To demonstrate how the collinearity diagnostics can 
improve the regression results in this example, Table 6 
presents the SWLS regression analysis output of the original 
models with all the explanatory variables and a reduced 
model with fewer explanatory variables. In the reduced 
model, all of the dietary intake variables are eliminated 
except total fat intake. After the number of correlated 
offending variables is reduced, the standard error of total fat 
intake is only the one forty-sixth of its standard error in the 
original model. The total fat intake becomes significant in 
the reduced model. The reduction of correlated variables 
appears to have substantially improved the accuracy of 
estimating the impact of total fat intake on BMI. Note that 
the collinearity diagnostics do not provide a unique path 
toward a final model. Different analysts may make different 
choices about whether particular predictors should be 
dropped or retained. 
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Table 2 
Correlation coefficient matrix of the data matrix X 
age black on on on on calorie protein Carbo- sugar fiber alcohol total. sat. mono. poly. 
any low-  low-  low- hydrate fat fat fat fat 


diet calorie fat carb 
diet diet diet “ 


age 1 

black 2 I 

on any diet ‘ F 1 

on low-calorie diet ‘ : 0.87 © | 

on low-fat diet : ; : : 1 

one low-carb diet ; ; : F ; 1 

calorie : : ‘ : d 1 

protein : : : 2 : : 0.75 1 

carb ; : : ‘ : : 0.84 0.45 1 

sugar : ; : , : : 0.58 : 0.84 1 

fiber : > : ‘ ; ; OST; 0.52 0.54 ; 1 

alcohol , : : : ; : ; : ; 5 2 1 

total.fat : : 5 ; : ; 0.86 0.72 0.54 : 0.48 : 1 

sat.fat “ : : ? : : : 0.74 0.56 0.47 ; 0.46 . 0.85 ] 

mono.fat © : : ; ? ; : 0.83 0.68 0.51 ; 0.46 ; 0.97 0.82 1 
poly.fat f : : ; : ; : 0.81 0.71 0.51 : 0.43 E 0.93 0.63 0.87 l 
: The term “carb” stands for carbohydrate. 


Correlation coefficients less than 0.3 are omitted in this table. 
Correlation coefficients larger than 0.3 are italicized in this table. 
Total Saturated Fatty Acids. 

Total Monounsaturated Fatty Acids. 

Total Polyunsaturated Fatty Acids. 


2 


=) 


Table 3 
Scaled condition indexes and variance decomposition proportions: Using TYPE1: OLS 


Scaled Proportion of the Variance of 


Scaled Intercept Age Black on any Diet on Low- on Low-fat on Low-carb Calorie Protein 
Condition Index Calorie Diet Diet Diet 


1 2 


0.574 


0.379 
0.794 


0.842 0.820 


DIN WS a Crcocy Gh SG wit 


38 0.970 0.960 
157 ‘ ‘ : ; : : , : 
581 ; : : : : ; : 0.993 0.966 
Scaled Carbohydrate Sugar Dietary Fiber Alcohol Total Fat Sat.fat ’ Mono.fat © Poly.fat “ 
Condition Index 


1 


ES) Eh SIC ON ES CHES CIS 


0.633 
38 : : : ‘ ; : : 3 
157 F , : : 0.304 0.866 0.890 0.904 
581 0.988 : 0.482 0.986 0.696 ; : 


“ The scaled variance decomposition proportions smaller than 0.3 are omitted in this table. 
> Total Saturated Fatty Acids. 

“ Total Monounsaturated Fatty Acids. 

“ Total Polyunsaturated Fatty Acids. 
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Table 4 
Scaled condition indexes and variance decomposition proportions: Using TYPE2: WLS 


Scaled Proportion of the Variance of 


Scaled Intercept Age Black on any Diet on Low- on Low-fat on Low-carb Calorie Protein 
Condition Index Calorie Diet Diet Diet 


0.609 


3 

3 ‘ 
3 : : . 0.347 
+ ; : 0.711 : 

5 : 

7 


i . 0.902 0.878 


37 0.959 0.940 
165 . : : : : : : : : 
566 ; , ‘ : ‘ : 4 0.992 0.963 
Scaled Carbohydrate Sugar _ Dietary Fiber Alcohol Total Fat Sat.fat ’ Mono.fat ° Poly.fat “ 
Condition Index 


Conn RB WWWN 


26 ; 0.630 


165 0.342 0.871 0.909 0.919 
566 0.987 0.486 0.981 0.658 


The scaled variance decomposition proportions smaller than 0.3 are omitted in this table. 
Total Saturated Fatty Acids. 

Total Monounsaturated Fatty Acids. 

Total Polyunsaturated Fatty Acids. 


Table 5 
Scaled condition indexes and variance decomposition proportions: Using TYPE3: SWLS 


Scaled Proportion of the Variance of 


Scaled Intercept Age Black on any Diet on Low- on Low-fat on Low-carb Calorie Protein 
Condition Index Calorie Diet Diet Diet 


0.717 1.278 0.553 . 
0.697 


0.766 1.686 0,461 


566 0.318 1.095 1.190 


“ The scaled variance decomposition proportions smaller than 0.3 are omitted in this table. 
> Total Saturated Fatty Acids. 

* Total Monounsaturated Fatty Acids. 

“ Total Polyunsaturated Fatty Acids. 
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Scaled condition indexes and variance decomposition proportions: Using TYPE3: SWLS 


Scaled Proportion of the Variance of 


Scaled 
Condition Index Carbohydrate Sugar _ Dietary Fiber Alcohol Total Fat Sat.fat ’ Mono.fat ‘ Poly.fat Zi 
1 
2 
3 
3 
3) 
4 
5) 
7 
8 
10 
11 
13 
aM : : 
26 : 0.379 
37 : : : : 
165 ; ‘ é : , 0.651 0.749 0.615 
566 1.008 1.509 0.740 1.036 0.805 0.486 : 0.390 
“ The scaled variance decomposition proportions smaller than 0.3 are omitted in this table. 
> Total Saturated Fatty Acids. 
“ Total Monounsaturated Fatty Acids. 
“ Total Polyunsaturated Fatty Acids. 
Table 6 
Regression analysis output using TYPE3: SWLS 
Original Model Reduced Model 
Variable Coefficient SE‘ Coefficient SE 
Intercept 24.14*#*? ery 24.20*** 2209 
Age 0.06 0.08 0.06 0.08 
Black B.A ORE 1.04 3.078 ** 0.98 
on any Diet® 1.79 152 1.28 1.80 
on Low-calorie Diet 4.09** 1.50 4.59** 1.69 
on Low-fat Diet 3.67 2.86 Sey 3.76 
on Low-carb Diet 0.46 Bel 0.87 3.86 
Calorie -0.88 2.36 
Protein 7.05 99 
Carbohydrate 3.69 9.62 
Sugar -0.31 Ledeh 
Dietary Fiber -14.52* Biteh] 
Alcohol 2.09 16.47 
Total Fat 29.34 oreo 1.47* 0.68 
Total Saturated Fatty Acids -15.90 20.18 
Total Monounsaturated Fatty Acids -22.40 23.01 
Total Polyunsaturated Fatty Acids -27.69 21.10 
Intracluster Coefficient p 0.0366 0.0396 


“ standard error. 
: payalue: 750.052 770.01; ***,0.005:; 


“ The reference category is “not being on diet” for all the on-diet variables here. 


4.3 Study two: Reference level for categorical 
variables 

As noted earlier, using non-survey data, dummy vari- 
ables can also play an important role as a possible source for 
collinearity. The choice of reference level for a categorical 
variable may affect the degree of collinearity in the data. To 
be more specific, choosing a category that has a low 
frequency as the reference and omitting that level in order to 


fit the model may give rise to collinearity with the intercept 
term. This phenomenon carries over to survey data analysis 
as we now illustrate. 

We employed the four on-diet dummy variables used in 
the previous study, which we denote this section as “on any 
diet” (DIET), ‘“‘on low-calorie diet”? (CALDIET), “on low- 
fat diet’ (FATDIET) and “one low-carbohydrate diet” 
(CARBDIET). The model considered here is: 
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BMI, i, = Bo + Botack * black,,, 
To Toruenin *TOTAL.FAT, 


hit 
+ Borer * DIET,;, 

+ Bearper * CALDIET,, 

+ Dexqpere HAL DIEL 


hit 


+ Boarspier © CARBDIET,,, + €,,, (20) 


where subscript Ait stands for the ¢™ unit in the selected 
PSU hi, black is the dummy variable of black (black = 1 
and non-black =0), and TOTAL.FAT is the variable of 
daily total fat intake. According to the survey-weighted 
frequency table, 15.04% of the respondents are “on any 
diet”, 11.43% of them are “on low-calorie diet’, 1.33% of 
them are “‘on low-fat diet” and 0.47% of them are “‘on low- 
carbohydrate diet”. Being on a diet is, then, relatively rare in 
this example. If we choose the majority level, “not being on 
the diet”, as the reference category for all the four on-diet 
dummy variables, we expect no severe collinearity between 
dummy variables and the intercept, because most of values 
in the dummy variables will be zero. However, when fitting 
model (20), assume that an analyst is interested to see the 
impact of “not on any diet” on respondent’s BMI and 
reverses the reference level of variable DIET in model (20) 
into “being on the diet”. This change may cause a near 
dependency in the model because the column in X_ for 
variable DIET will nearly equal the column of ones for the 


Table 7 


intercept. The following empirical study will illustrate the 
impact of this change on the regression coefficient esti- 
mation and how we should diagnose the severity of the 
resulting collinearity. 

Table 7 and 8 present the regression analysis output of 
the model in (20) using the three regression types, OLS, 
WLS and SWLS, listed in Table 1. Table 7 is modeling the 
effects of on-diet factors on BMI by treating “not being on 
the diet” as the reference category for all the four on-diet 
variables. While Table 8 changes the reference level of 
variable DIET from “not on any diet” into “On any diet” 
and models the effect of “not on any diet” on BMI. The 
choice of reference level effects the sign of the estimated 
coefficient for variable DIET but not its absolute value or 
standard error. The size of the estimated intercept and its SE 
are different in Tables 7 and 8, but the estimable functions, 
like predictions, will of course, be the same with either set 
of reference levels. The SE of the intercept is about three 
times larger when “on any diet” is the reference level for 
variable DIET (Table 8) than when it is not (Table 7). 

When choosing “not being on any diet” as the reference 
category for DIET in Table 9, the scaled condition indexes 
are relatively small and do not signify any remarkable near- 
dependency regardless of the type of regression. Only the 
last row for the largest condition index is printed in Tables 9 
and 10. Often, the reference category for a categorical pre- 
dictor will be chosen to be analytically meaningful. In this 
example, using “not being on any diet” would be logical. 


Regression analysis output: When “not on any diet” is the reference category for DIET variable in the model 


Regression Type Intercept black total.fat on any diet on low-calorie diet on low-fat diet on low-carb diet 
TYPE PY IIE RE) BS 20ee 0.95 3.03 IS DHS -1.48 
OLS (0.61)? (0.70) (0.72) (1.94) (2.03) (25/2) (3.66) 
TYPE2 2618 ees 3205 ate 1.44* 39 4.46* 3.86 0.94 
WLS (0.58) (0.82) (0.67) (1.67) (1279) @59) (4.22) 
TYRES One aca Sipe 1.44* iL 3D) 4.46** 3.86 0.94 
SWLS (0.64) (0.99) (0.63) (1.80) (1.70) (3.73) (3.87) 


Epevatie 2 0/05) 25°0:0 1 Ses 00s. 


i; ; : 
’ Standard errors are in parentheses under parameter estimates. 


Table 8 


Regression analysis output: When “on any diet” is the reference category for DIET variable in the model 


Regression Type Intercept black total.fat not on any diet on low-calorie diet on low-fat diet on low-carb diet 
TYPE Sh) Za 3) 0.95 -3.03 eS Zl -1.48 
OLS (2.00)’ (0.70) (0.72) (1.94) (2.03) (12) (3.66) 
TYPE2 DSL EES 3169 Te 1.44* -1.39 4.46* 3.86 0.94 
WLS CLE Fln) (0.82) (0.67) (1.67) (9) (2.59) (4.22) 
TYPE3 DN Syn Pe 3.0524" 1.44* -1.39 4.46** 3.86 0.94 
SWLS (1.75) (0.99) (0.63) (1.80) (1.70) (3.73) (3.87) 


“p-values*, 0.05. 2" 10 Oats i) 005: 


b : Fi 
Standard errors are in parentheses under parameter estimates. 
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In Table 10, when “on any diet” is chosen as the 
reference category for variable DIET, the scaled condition 
indexes are increased and show a moderate degree of 
collinearity (condition index larger than 10) between the on- 
diet dummy variables and the intercept. Using the table of 
scaled variance decomposition proportions, in OLS and 
WLS, dummy variable for “not on any diet” and “on low- 
calorie diet” are involved in the dominant near-dependency 
with the intercept; however, in SWLS, only the dummy 
variable for “not on any diet” is involved in the dominant 
near-dependency with the intercept and the other three on- 
diet variables are much less worrisome. 


5. Conclusion 


Dependence between predictors in a linear regression 
model fitted with survey data affects the properties of 
parameter estimators. The problems are the same as for non- 
survey data: standard errors of slope estimators can be 
inflated and slope estimates can have illogical signs. In the 
extreme case when one column of the design matrix is 
exactly a linear combination of others, the estimating equa- 
tions cannot be solved. The more interesting cases are ones 
where predictors are related but the dependence is not exact. 
The collinearity diagnostics that are available in standard 
software routines are not entirely appropriate for survey 
data. Any diagnostic that involves variance estimation needs 
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modification to account for sample features like strati- 
fication, clustering, and unequal weighting. This paper 
adapts condition numbers and variance decompositions, 
which can be used to identify cases of less than exact 
dependence, to be applicable for survey analysis. 

A condition number of a survey-weighted design matrix 
W'’X is the ratio of the maximum to the minimum 
eigenvalue of the matrix. The larger the condition number 
the more nearly singular is X’ WX, the matrix which must 
be inverted when fitting a linear model. Large condition 
numbers are a symptom of some of the numerical problems 
associated with collinearity. The terms in the decomposition 
also involve “misspecification effects” if the model errors 
are not independent as would be the case in a sample with 
clustering. The variance of an estimator of a regression 
parameter can also be written as a sum of terms that involve 
the eigenvalues of W'’X. The variance decompositions for 
different parameter estimators can be used to identify pre- 
dictors that are correlated with each other. After identifying 
which predictors are collinear, an analyst can decide 
whether the collinearity has serious enough effects on a 
fitted model that action should be taken. The simplest step is 
to drop one or more predictors, refit the model, and observe 
how estimates change. The tools we provide here allow this 
to be done in a way appropriate for survey-weighted 
regression models. 


Largest scaled condition indexes and its associated variance decomposition proportions: When “not on any diet” is the reference 


category for variable DIET in the model 


Scaled Scaled Proportion of the Variance of 

Condition Index Intercept gender total.fat onany diet onlow-calorie diet | onlow-fatdiet on low-carb diet 
TYPET: OLS 

6 0.005 0.000 0.016 0.949 0:932 ONS7 0.200 
TYPE2: WLS 

6 0.013 0.008 0.020 0.938 0.926 0.189 0.175 
TYPE3: SWLS 

6 0.006 0.007 0.013 0.686 0.741 0.027 0.061 
Table 10 


Largest scaled condition indexes and its associated variance decomposition proportions: When “on any diet” is the reference 


category for variable DIET in the model 


Scaled 

Condition Index Intercept gender total.fat 
TYPE OLS 

7 0.982 0.001 0.034 
TYPE2; WLS 

17 0.982 0.011 0.029 
TYPE3: SWLS 

Ly 0.897 0.018 -0.006 


0.968 


0.968 


O97 


Scaled Proportion of the Variance of 
not on any diet 


on low-calorie diet on low-fat diet on low-carb diet 


0.831 O55 0.186 
0.820 0.182 0.160 
0.318 0.014 -0.019 
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Bayesian inference for finite population 
quantiles from unequal probability samples 


Qixuan Chen, Michael R. Elliott and Roderick J.A. Little ' 


Abstract 


This paper develops two Bayesian methods for inference about finite population quantiles of continuous survey variables 
from unequal probability sampling. The first method estimates cumulative distribution functions of the continuous survey 
variable by fitting a number of probit penalized spline regression models on the inclusion probabilities. The finite population 
quantiles are then obtained by inverting the estimated distribution function. This method is quite computationally 
demanding. The second method predicts non-sampled values by assuming a smoothly-varying relationship between the 
continuous survey variable and the probability of inclusion, by modeling both the mean function and the variance function 
using splines. The two Bayesian spline-model-based estimators yield a desirable balance between robustness and efficiency. 
Simulation studies show that both methods yield smaller root mean squared errors than the sample-weighted estimator and 
the ratio and difference estimators described by Rao, Kovar, and Mantel (RKM 1990), and are more robust to model 
misspecification than the regression through the origin model-based estimator described in Chambers and Dunstan (1986). 
When the sample size is small, the 95% credible intervals of the two new methods have closer to nominal confidence 
coverage than the sample-weighted estimator. 


Key Words: Bayesian analysis; Cumulative distribution function; Heteroscedastic errors; Penalized spline regression; 


Survey samples. 


1. Introduction 


We consider inference for finite population quantiles of a 
continuous variable from a sample survey with unequal in- 
clusion probabilities. The finite-population quantiles are 
usually estimated by the sample-weighted quantiles, a 
Horvitz-Thompson type estimator. Often in sample surveys 
the design variable (here, the inclusion probability) or a 
correlated auxiliary variable is measured on the non- 
sampled units, and this information can be used to improve 
the efficiency of the sample-weighted estimators (Zheng 
and Little 2003; Chen, Elliott, and Little 2010). 

Methods for using auxiliary information in estimating 
finite-population distribution functions have been exten- 
sively studied. Chambers and Dunstan (1986) proposed a 
model-based method, illustrating their approach for a zero 
intercept linear regression superpopulation model. We refer 
to this estimator from now on as the CD estimator. Dorfman 
and Hall (1993) applied the CD approach, replacing the 
linear regression model with a non-parametric model. 
Lombardia, Gonzalez-Manteiga, and Prada-Sanchez (2003, 
2004) proposed a bootstrap approximation to these esti- 
mators based on resampling a smoothed version of the 
empirical distribution of the residuals. Kuk and Welsh 
(2001) also modified the CD approach to address departures 
from the model by estimating the conditional distribution of 
residuals as a function of the auxiliary variable. Rao, Kovar, 
and Mantel (RKM 1990) demonstrated advantages of 


design-based ratio and difference estimators over the CD 
estimator when the model is misspecified. Wang and 
Dorfman (1996) suggested a weighted average of the CD 
and the RKM estimators. Kuk (1993) proposed a kernel- 
based estimator that combines the known distribution of the 
auxiliary variable with a kernel estimate of the conditional 
distribution of the survey variable given the value of the 
auxiliary variable. Chambers, Dorfman, and Wehrly (1993) 
proposed a kernel-smoothed model-based estimator, and 
Wu and Sitter (2001) and Harms and Duchesne (2006) 
proposed calibration type estimators. 

Research on using auxiliary information for inference 
about finite population quantiles (defined as the inverse of 
the distribution function) is more limited. Chambers and 
Dunstan (1986) discussed estimation by inverting the CD 
estimator of the distribution function, but did not compare 
the performance of this quantile estimator with alternatives. 
Rao etal. (1990) proposed simple ratio and difference 
quantile estimators that were considerably more efficient 
than the sample-weighted estimator when the survey out- 
come was approximately proportional to the auxiliary 
variable. 

We assume here unequal probability sampling with 
inclusion probabilities that are known for all the units in the 
population. We develop two Bayesian spline-model-based 
estimators of finite population quantiles that incorporate the 
inclusion probabilities. The first method is to estimate the 
distribution function at a number of sample values using 
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Bayesian penalized spline predictive estimators (Chen et al. 
2010). The finite population quantiles are then estimated by 
inverting the predictive distribution function. The second 
method is a Bayesian two-moment penalized spline 
predictive estimator, which predicts the values of non- 
sampled units based on a normal model, with mean and 
variance both modeled with penalized splines on the inclu- 
sion probabilities. We compare the performance of these 
two new methods with the sample-weighted estimator, the 
CD estimator, and the RKM’s ratio and difference esti- 
mators, using simulation studies on artificially generated 
data and farm survey data. 


2. Estimators of the quantiles 


Let s denote an unequal probability random sample of 
size n, drawn from the finite population of N identifiable 
units according to inclusion probabilities {7,,i = 1,..., N}, 
which are assumed to be known for all the units before a 
sample is drawn. Let Y denote a continuous survey vari- 
able, with values {y,, V5, ..., y,} observed in the random 
sample s. The finite-population o- quantile of Y is defined 
as: 


O(a) = inf fr NabY ahbAC Sanya at}, (1) 


where A(w) = 1 when uw = 0 and A(u) = 0 elsewhere. 
The 0(a) is often estimated using the sample-weighted a- 
quantile 6(«) sa iniiK F(t) > a}, where F(t) is the 
sample-weighted distribution function given by 
ORS FONE WANA ZUR 

Woodruff (1952) proposed a method of calculating confi- 
dence limits for the sample weighted o- quantile. First, a 
pseudo-population is obtained by weighting each sample 
item by its sampling weight; the standard deviation of the 
percentage of items less than the estimated a- quantile is 
estimated; and the estimated standard deviation is multiplied 
by the appropriate z percentile and is added to and sub- 
tracted from a to construct the confidence limits for the 
percentage of items less than the estimated a- quantile. Fi- 
nally, the values of the survey variable corresponding to the 
confidence limits of the percentage of items less than the 
estimated a-quantile are read-off the weighted pseudo- 
population arrayed in order of size. Variance estimation of 
the percentage of items in the pseudo-population less than 
the estimated a- quantile is discussed in Woodruff (1952). 
Sitter and Wu (2001) showed that the Woodruff intervals 
perform well even in moderate to extreme tail regions of the 
distribution function. An alternative variance estimate was 
derived by Francisco and Fuller (1991) using a smoothed 
version of the large-sample test inversion. 
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2.1 Bayesian model-based approach, inverting the 
estimated CDF 


The finite population quantile function is the inverse of 
the finite population cumulative distribution function 
(CDF), defined as F(t) = N''>™, A(t - y,), where 
A(%)i=- 1 s~whene gy 2a0mand A(x) =00> elséewheremgwe 
can estimate the finite population quantiles by first building 
a continuous and strictly monotonic predictive estimate of 
F(t), by treating A(t — y) as a binary outcome variable 
and applying methods for estimating finite population 
proportions. 

In particular, Chen etal. (2010) proposed a Bayesian 
penalized spline predictive (BPSP) estimator for finite popu- 
lation proportions in unequal probability sampling. They 
regress the binary survey variable z on the inclusion 
probabilities in the sample, using the following probit 
penalized spline regression model (2) with m_ pre-selected 
fixed knots: 


O”'(E(z, |B, 5, 7;)) = Bj+> BT; +) B(x; - k,)° 


b, ~ N(0, 7°). (2) 


Self-representing units are included by setting 7m, = 1. 
Assuming non-informative prior distributions for 8 and 7’, 
they simulated draws of z for the non-sampled units from 
their posterior predictive distribution. A draw from the 
posterior distribution of the finite population proportion is 
then obtained by averaging the observed sample units and 
the draws of the non-sample units. This is repeated many 
times to simulate the posterior distribution of the finite 
population proportion. Simulation studies indicated that the 
BPSP estimator is more efficient than the sample-weighted 
and generalized regression estimators of the finite popu- 
lation proportion, with confidence coverage closer to 
nominal levels. 

We employ the BPSP approach 7 times to estimate 
F(t) at each of the sampled values of y, ¢ = {y,, 
V>, ++» Y,$- This estimator does not take into account the 
fact that we are estimating a whole distribution function, and 
is not necessarily a monotonic function. In addition, linear 
interpolation of the 1 estimated distribution functions may 
lead to a poorly-estimated CDF. To overcome these two 
problems, we fit a smooth cubic regression curve to the n 
estimated distribution functions with monotonicity con- 
straints (Wood 1994). We denote the resulting estimated 
distribution function as F(t). The Bayesian model-based 
estimator of O(a), obtained by inverting the estimated 
CDF, 1s then defined as follows: 


Gece (0) inks ee a (3) 
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We also fit two other monotonic smooth regression curves 
to the upper and lower limits of the 95% credible intervals 
(CI) of these estimated distribution functions, denoted as 
F(t) and F,(t). To reduce computation time in our 
simulation studies, we only estimate the CDF at k <n 
pre-selected sample points. 

The basic idea behind this approach is shown graphically 
in Figure |. Suppose a sample of size 100 is drawn from a 
finite population. We pick 20 observations from the sample 
and estimate their corresponding distribution functions and 
associated 95% CI using the BPSP estimator. In Figure 1(a) 
we plot the BPSP estimates of these 20 points with black 
dots and the upper and lower limits of 95% CI with “-” 
signs, and connect the upper and lower limits with solid 
lines. In Figure 1(b) we add three monotonic smooth predic- 
tive curves using black solid curve for the point estimate and 
black dash curves for the upper and lower limits of the 95% 
Cl: 

In Figure 1(c) we draw a horizontal line across the graph 
with a as the y-axis value. We read x,, x, and XxX, 
respectively from the x-axis such that F,(x,) = a, 
F(x) = a, and F.,(x,) = a. Then x is the inverse-CDF 
Bayesian estimate of O(a). If the 95% CI of the distribution 
function F(-) is formed by splitting the tail areas of the 
posterior distribution equally, the interval formed by x, and 
Xz 18 a 95% Cl of O(a). The proof is as follows: If a is 
the lower limit of the 95% Cl of F(x,), only 2.5 percent of 
the draws of F(x) in the posterior distribution are smaller 
than a. That is, 


PrP (a) > FCF (x7))) = Pr(O(a) > x,) = 0.025: 


Cumulative distribution function 
Cumulative distribution function 


value of Y 


Figure 1 


value of Y 
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Similarly with a as the upper limit of the 95% CI of 
Fi(acg) «Pr O(a.) <exy) = 0.975 aiherefores theres 95% 
probability that O(a) is within x, and x, in the posterior 
distribution, given the sample. 

This inverse-CDF Bayesian model-based approach 
avoids strong modeling assumptions, and can be applied to 
normal or skewed distributions. Estimating the distribution 
function at all 1 sample units makes full use of the sample 
information, but is computationally intensive; estimating the 
distribution function at k < n values reduces computation 
time at the expense of some loss of efficiency. In the 
traditional approach, the population quantiles are estimated 
by inverting the unsmoothed empirical CDF. We recom- 
mend fitting a smooth cubic regression curve to the esti- 
mated distribution functions before inverting the estimated 
CDF. The resulting quantile estimates are more efficient, 
because the smooth curve exploits information from all the 
data. Simulations not shown here suggest that the estimated 
CDF distribution function curve estimated based on a well- 
chosen subset of the k sample units is similar to the curve 
estimated based on all sample units, but the computation 
time is significantly reduced. 

We suggest choosing the subset of & data points at 
evenly spaced intervals in the middle of the distribution, and 
more frequent intervals in the extremes to improve the 
estimate of the CDF in the tails. For instance, in our 
simulation study with a sample size of 100, we estimated the 
distribution functions at 20 points: the 3 smallest, the 3 
largest, and 14 other equally spaced points in the middle of 
the ordered sample. 


(b) 


Cumulative distribution function 


x(B), x, x(A) 


value of Y 


Inverse-CDF Bayesian model-based approach in estimating finite population distribution functions and associated 


quantiles illustrated using a sample of size 100 drawn from a finite population. (a) BPSP method is used to estimate the 
finite population distribution functions at 20 sample points; the dots denote BPSP estimators and the minus signs denote 
the upper and lower limits of the 95% CI. (b) Three monotonic smooth cubic regression models are fit on the BPSP 
estimators, upper limits, and lower limits; the solid curve is the predictive continuous distribution functions and the two 
dash curves are the 95% CI of the distribution functions. (c) The point estimate and 95% CI of population o- quantile 
are obtained by inverting the estimated CDF; x is the point estimate, and x(B) and x(A) are the lower and upper limits of 


the 95% CI 
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2.2 Bayesian two-moment penalized spline 
predictive approach 


We consider alternative estimators of finite population 
quantiles of the form: 


O(a) = 
inf fs NT (Ae es Das > AG = 5) 2! a (4) 


ies JES 
where j, is the predicted value of the /  non-sample unit 
based on a regression on the inclusion probabilities {z,}. A 
basic normal model for a continuous outcome assumes a 
mean function that is linear in {7,}, that is: 


ind 


Y, ~N(Bo + B,%;¢;,0°), (5) 


with known constants c, to model non-constant variance. 
This leads to a biased estimate of @(a) when the relation- 
ship is not linear. For estimating finite population totals, 
Zheng and Little (2003, 2005) replaced the linear mean 
function in (5) with a penalized spline, and assumed 
c, = 7," with some known value of k. Simulations sug- 
gested that their model-based estimator of the finite popu- 
lation total outperforms the sample-weighted estimator, 
even when the variance structure is misspecified. 

For estimation of quantiles rather than the total, correct 
specification of the variance structure is important in order 
to avoid bias. Therefore, we extend the penalized spline 
model in Zheng and Little (2003) by modeling both the 
mean and the variance using penalized splines. The two- 
moment penalized spline model can be written as (Ruppert, 
Wand, and Carroll 2003, page 264): 

ind 


Y, ~ N(SPL,(1,,k), exp(SPL,(x,,k’))), 


m 


p 
SPL, (1,4) = By + epee + Ge Shes 
k=l [=I 


lid 


b, ~ N(0,1;), 


P mM, 
SPUR a! Opes Dec. pm + dD v(t; She 
bed iz 
iid 


v, ~ N(0,t.). (6) 


In (6), the mean and the logarithm of the variance are 
modeled as penalized splines (SPL,) and (SPL,) on {7}. 
Modeling the logarithm of the variance ensures positive 
estimates of the variance. We allow different numbers 
(m,, m,) and locations (k, k’) of the knots for the two 
splines. 
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Ruppert et al. (2003) suggested an iterative approach to 
estimate the parameters in (6). They first assumed that 
SPL, was known and fitted a linear mixed model to esti- 
mate the parameters in SPL,. They calculated the square of 
the difference between Y and SPL,, which followed a 
Gamma distribution with the shape parameter as 2 and the 
scale parameter of 2SPL,. They then fitted a generalized 
linear mixed model for the squared differences to estimate 
the parameters in SPL,. They iterated the above procedures 
until the parameter estimates converged. This iterative 
approach is simple to implement. However, our goal here is 
not to estimate the parameters but to obtain Bayesian 
predictions of Y for the non-sample units so that we can use 
(4) to estimate the quantiles. 

Crainiceanu, Ruppert, Carroll, Joshi, and Goodner (2007) 
developed Bayesian inferential methodology for (6). They 
noted that the implementation of MCMC using multivariate 
Metropolis-Hastings steps is unstable with poor mixing 
properties. They suggested adding error terms to the second 
spline to make computations feasible, replacing sampling 
from complex full conditionals by simple univariate 
Metropolis-Hastings steps. This idea can be expressed as 

ind 


Y, ~ N(SPL,(1,,4), 02 (1;)), 


iid 
log(o2(1;)) ~ N(SPL,(1,,k’), 07). 


We used a prior distribution N(0,10°) for the fixed effects 
parameters f and a, and a proper inverse-gamma prior 
distribution IGamma(10°,10~°) for the variance compo- 
nents t; and t.. We fixed the values of o7, = 0.1. The 
full conditionals of the posterior are detailed in Crainiceanu 
et al. (2007). 

The posterior distribution of the finite population a- 
quantile is simulated by generating a large number D of 
draws and using the predictive estimator form 


Pda = 
int Pes =ayahice SNe 9) > a 


ies JES 


where tee is a draw from the posterior predictive distri- 
bution of the j™ non-sampled unit of the continuous out- 
come. The average of these draws simulates the Bayesian 
two-moment penalized spline predictive (B2PSP) estimator 
of the finite population a- quantile, 


D 
ee (ee a ya (es 


d=) 


The Bayesian 95% credible interval for the population o- 
quantile in the simulations is formed by splitting the tail area 
equally between the upper and lower endpoints. 
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3. Simulation study 


3.1 Simulation study with artificial data 


We first simulated a super-population of size M = 
20,000. The size variable X in the super-population takes 
20,000 consecutive integer values from 710 to 20,709. A 
finite population of size N = 2,000 was then selected from 
the super-population using systematic probability pro- 
portional to size (pps) sampling with the probability propor- 
tional to the inverse of the size variable. Consequently, the 
size variable in the finite population has a right skewed 
distribution. The survey outcome Y was drawn from a 
normal distribution with mean /f(7) and error variance 
equal to 0.04 (homoscedastic error) or = (heteroscedastic 
error). Three different mean structures f(t) were simu- 
lated: no association between Y and m (NULL) /f(2) = 
0.5, a linear association (LINUP) f(z) = 62, and a 
nonlinear association (EXP) /(m) = exp(-4.64 + 527). 
For each of the six simulation conditions, one thousand 
replicate finite populations were generated, and a systematic 
pps sample (n = 100) was drawn from each population 
with x as the size variable; thus x, = nx,/>), x. Scatter 
plots of Y versus 7 for these six populations are displayed 
in Figure 2. 

We compared the performance of the Bayesian inverse- 
CDF and the B2PSP estimators with five alternative ap- 
proaches: 


a) SW, the sample-weighted estimator defined by 
inverting F,. 

b) Smooth-SW, the smooth sample-weighted esti- 
mator. A smooth cubic regression curve was fit to 
F., and denoted as F.. The smooth sample- 
weighted estimator is then defined as 6, = inf {r; 
Fa ah, 

c) CD, the Chambers and Dunstan estimator (1986), 
by assuming the following model: Y, = Br, + 

/x,U;, where U, is an independent and 
identically distributed random variable with zero 
mean. 

d) Ratio, the RKM’s ratio estimator (1990) given by 
(6, (a)/8,(a)}x 8, (a), where 6, (a) and 
8 (a) denotes respectively the sample-weighted 
estimates for Y and the size variable X, and 
8 (a) is the known population quantile of X. 

e) Diff, the RKM7’s difference estimator (1990) 
given by 0, (a) +R x tO, (Oye 6.(a)}, where 
R is the sample-weighted estimate of YX. 


The seven estimators for the finite-population 10", 25", 
50", 75", and 90" percentiles were compared in terms of 
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empirical bias and root mean squared error (RMSE). 
Because of the complexity in the variance estimation for the 
CD and RKM’s estimators, we only compared the average 
width and the non-coverage rate of the 95% confi- 
dence/credible interval (CI) for the two Bayesian model- 
based estimators and the sample-weighted estimator. For the 
95% CI, we used Woodruff's method for the sample- 
weighted estimator, the method illustrated in Figure 1(c) for 
the inverse-CDF Bayesian estimator, and the 95% posterior 
probability of the quantile with equal tails for the B2PSP 
estimator. We used cubic splines with 15 equally spaced 
knots. 

Tables 1 and 2 show the empirical bias and RMSE for 
the three normal distributions with homoscedastic errors and 
with heteroscedastic errors, respectively. Overall, the 
empirical bias in estimating the five quantiles is similar 
using the two Bayesian estimators, the two sample-weighted 
estimators, and the RKM’s two design-based estimators. In 
contrast, the CD estimator has large bias and RMSE in all 
scenarios except for LINUP with heteroscedastic error, 
where its underlying model is correctly specified. The two 
Bayesian model-based estimators yield smaller root mean 
squared errors than the other estimators, and this improve- 
ment in efficiency is substantial in some _ scenarios, 
especially using the B2PSP estimator. By applying a smooth 
cubic regression curve on the estimated empirical sample- 
weighted CDF, the smooth-sample-weighted estimator 
gains some efficiency over the conventional sample- 
weighted estimators, but the RMSE is still larger than the 
Bayesian Inverse-CDF estimator. Comparisons of the three 
design-based estimators suggest that none of the three 
estimators uniformly dominates the other two. Specifically, 
the sample-weighted estimator has smaller RMSE than the 
RKM difference and ratio estimators for all five quantiles in 
the NULL and for the lower quantiles in the LINUP and 
EXP populations; on the other hand, the RKM estimators 
have smaller RMSE at the upper quantiles in the LINUP 
and EXP populations. 

Table 3 shows the average width and non-coverage rate 
of 95% CI for the two Bayesian model-based estimators and 
the sample-weighted estimator. Overall, the two Bayesian 
model-based estimators yield shorter average 95% Cl 
widths than the sample-weighted estimator. The coverage 
rate of the 95% CI is similar among the three estimators, 
except that when « is equal to 0.1, where the 95% CI of the 
B2PSP estimator has the shortest average width and very 
good coverage, while the sample-weighted estimator has 
serious under-coverage. This happens because the Woodruff 
method for estimating the variance of the sample-weighted 
estimator is based on a large sample assumption, but here 
the pps sampling leads to only a small number of cases 
being sampled in the lower tail. 
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NULL + homoscedasticity 
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NULL + heteroscedasticity 


0.00 0.05 0.10 0.15 


Inclusion probability 


LINUP + heteroscedasticity 


0.00 0.05 0.10 0.15 


Inclusion probability 


EXP + heteroscedasticity 


LS 


1.0 


O10} 05 


-0.5 


0.00 0.05 0.10 0.15 


Inclusion probability 


Figure 2 Scatter plots of Y versus the inclusion probabilities for the six artificial finite populations of size equal to 2,000 


Although the sample-weighted estimator performs simi- 
larly with the two Bayesian spline-model-based estimators 
in terms of overall empirical bias, the conditional bias of 
estimates varies largely as the sample mean of the inclusion 
probability increases. Following Royall and Cumberland 
(1981), the estimates from the 1,000 samples were ordered 
according to the sample mean of the inclusion probabilities 
and were split into 20 groups of 50 each, and then the 
empirical bias was calculated for each group. Figure 3 
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displays the conditional bias of the two Bayesian estimators 
and the sample-weighted estimator for the 90" percentile in 
the “EXP + homoscedastic error” case. Figure 3 shows that 
there is a linear trend for the bias in the sample-weighted 
estimator as the sample mean of the inclusion probabilities 
increases, while the grouped bias of the two Bayesian 
spline-model-based estimators is less affected by the sample 
mean of inclusion probabilities. Similar findings are also 
seen in other scenarios. 
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Table 1 


Comparisons of empirical bias and root mean squared errors x 10° of @(a) for a =0.1, 0.25, 0.5, 0.75, and 0.9: Scenarios 


homoscedastic errors 


NULL 


Inverse-CDF 
B2PSP 

SW 
Smooth-SW 
CD 

RKM’s Ratio 
RKM’s Diff 


LINUP 


Inverse-CDF 
B2PSP 

SW 
Smooth-SW 
CD 

RKM’s Ratio 
RKM’s Diff 


EXP 


Inverse-CDF 
B2PSP 

SW 
Smooth-SW 
CID 

RKM’s Ratio 
RKM’s Diff 


Table 2 


0.1 


0.25 


3 
| 
3 
-4 

phy) 
25 
a) 


3 
4 
3 
-5 
35 
-9 
-4 


0.4 
-6 


Empirical bias 


0.5 


0.75 


0.9 


Empirical RMSE 


0.1 0.25 0.5 
46 37 36 
Al 33 31 
54 Al 39 
50 39 37 
203 274 266 
Al 125 159 
58 58 94 
70 49 39 
56 43 35 
Ta 57 48 
Wh 53 45 
104 38 39 
95 67 53 
17 55 45 
60 45 4l 
52 40 35 
65 49 46 
62 47 43 
96 57 21 
87 65 50 
65 49 47 
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Comparisons of empirical bias and root mean squared errors x 10° of @(a) for a =0.1, 0.25, 0.5, 0.75, and 0.9: Scenarios with 


heteroscedastic errors 
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Table 3 
Comparisons of average width and non-coverage rate of 95% CI x 10° of O(a) for a =0.1, 0.25, 0.5, 0.75, and 0.9 
Average width of 95% CI Non-coverage rate of 95% CI 
0.1 0.25 0.5 0.75 0.9 0.1 0.25 0.5 0.75 0.9 
Homoscedastic errors 
NULL 
Inverse-CDF 199 156 141 ey 184 46 35 44 38 67 
B2PSP 178 134 118 134 La 52 oP) 61 ao 50 
SW 195 164 151 167 peg Pie 65 46 40 38 
LINUP 
Inverse-CDF 20a 207 157 139 141 61 45 od 46 52 
B2PSP 230 167 134 WSs: 121 58 54 44 mh 57 
SW 248 231 188 179 187 119 60 42 4] 39 
EXP 
Inverse-CDF 234 184 163 nT. 234 59 44 47 40 42 
BZPSP Zly Ley 132 144 156 54 39 55 a5 60 
SW 231 199 175 210 402 106 64 47 40 40 
Heteroscedastic errors 

NULL 
Inverse-CDF 146 104 90 101 137 42 43 38 38 47 
B2PSP 107 89 dis 89 107 38 49 a1 68 65 
SW 146 101 91 113 169 80 60 51 37 42 
LINUP 
Inverse-CDF 131 107 104 124 154 70 Sil 36 42 40 
B2PSP 125 oF 87 93 116 47 eo) 50 58 52 
SW 141 110 133 184 219 138 69 4] 50 42 
EXP 
Inverse-CDF 13d 09 99 134 242 63 49 34 40 4] 
B2PSP 116 92 84 98 139 ey 55 40 63 59 
SW 135 100 106 186 378 111 65 46 45 34 


Group empirical bias 
0.00 
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Figure 3. Variation of empirical bias of the three estimators for 90" percentile from the “EXP + homoscedasticity” case 
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3.2 Simulation study with the broadacre farm 
survey data 


The B2PSP estimator assumes the outcome has a normal 
distribution, after conditioning on the inclusion proba- 
bilities. Since the inverse-CDF Bayesian model-based ap- 
proach does not assume normality, we might expect it to 
out-perform the B2PSP when the normality assumption is 
violated. This motivates a comparison of the sample- 
weighted and the inverse-CDF Bayesian estimators for non- 
normal data. 

The population considered here is defined by 398 
broadacre farms (farms involved in the production of cereal 
crops, beef, sheep and wool) with 6,000 or less hectares that 
participated in the 1982 Australian Agricultural and Grazing 
Industries Survey carried out by the Australian Bureau of 
Agricultural and Resource Economics (ABARE 2003). The 
Y variable is the total farm cash receipts. One thousand 
systematic pps samples of size equal to 100 were drawn 
with the farm area, X, as the size variable, that is, larger 
farms are more likely to be selected into the sample. Figure 
4 is the scatter plot of Y versus the size variable X for these 


1,500,000 


1,000,000 


Total Income 


500,000 


2,000 


3,000 


211 


farms, with filled circles representing a selected pps sample. 
This shows that the variation of Y increases as X increases. 
Moreover, Y is right-skewed given X. A simulation study 
using this broadacre farms data was conducted to compare 
the two Bayesian spline-model-based estimators with the 
sample-weighted estimator. 

Table 4 shows the simulation results. The inverse-CDF 
Bayesian approach yields smaller empirical bias and RMSE, 
and shorter average length of 95% CI than the sample- 
weighted estimator in general. The 95% CI of the inverse- 
CDF Bayesian approach also have closer to nominal level 
confidence coverage than the sample-weighted estimator 
when q is 0.1 and 0.25. However, in the upper tail with 
a =0.90, the non-coverage rate of the inverse-CDF 
Bayesian approach is higher than the nominal level 0.05, 
while the Woodruff CI of the sample-weighted estimator 
does well. This is consistent with the findings of Sitter and 
Wu (2001) that the Woodruff intervals perform well even in 
the moderate to extreme tail regions of the distribution 
function. Since the conditional normality assumption is not 
reasonable here, the B2PSP estimator is biased and the 95% 
CI has poor confidence coverage. 


4,000 


5,000 


6,000 


Farm Area in Hectares 


Figure 4 Scatter plot of the broadacre farm data with the filled circles representing a pps sample 
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Table 4 


Empirical bias x 10°, root mean squared errors x 10°, average width of 95% CI x 107, and non-coverage rate of 95% CI x 10° of 


O(a) for a =0.1, 0.25, 0.5, 0.75, and 0.9: The broadacre farm data 


0.1 0.25 
Inverse-CDF 8 14 
B2PSP -110 =125 
SW 20 -19 
Inverse-CDF 117 7 
B2PSP 113 141 
SW 132 173 
Inverse-CDF 402 443 
B2PSP 170 B07) 
SW 285 468 
Inverse-CDF 96 53) 
B2PSP 670 258 
SW 220 1 


4. Discussion 


Sample-weighted estimators for finite population 
quantiles are widely used in survey practice. Although the 
sample-weighted estimators with Woodruffs confidence 
intervals are easy to compute and can provide valid large- 
sample inferences, they may be inefficient and confidence 
coverage can be poor in small-to-moderate-sized samples. 
Model-based estimators can improve the efficiency of the 
estimates when the model is correctly specified, but lead to 
biased estimates when the model is misspecified. To 
achieve the balance between robustness and efficiency, we 
considered spline-model-based estimators. For the quantile 
estimation of a continuous survey variable, we can either 
estimate the model-based distribution functions and invert 
the distribution functions to obtain quantiles, or model the 
survey outcome on the inclusion probabilities directly. In 
this paper, we proposed two Bayesian spline-model-based 
quantile estimators. The first method is the Bayesian 
inverse-CDF estimator, obtained by inverting the spline- 
model-based estimates of distribution functions. The second 
method is the B2PSP estimator, estimated by assuming a 
normal distribution for the continuous survey outcome, with 
the mean function and the variance function both modeled 
using splines. 

The simulations suggest that the two Bayesian spline- 
model-based estimators outperform the sample-weighted 
estimator, the design-based ratio and difference estimators, 
as well as the CD model-based estimator when its assumed 
model is incorrect. Both new methods yield smaller root 
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0.5 0.75 0.9 
Empirical bias 
10 -22 -60 
-63 -12 88 
-17 -21 -61 
Empirical RMSE 
108 164 256 
124 140 206 
167 226 350 
Average width of 95% CI 
501 697 906 
539 726 964 
615 864 1,589 
Non-coverage rate of 95% CI 
26 52 90 
42 8 ey 
68 42 44 


mean squared errors whether there is no association, a linear 
association, or a nonlinear association between the survey 
outcome and the inclusion probability. In some scenarios, 
the improvement in efficiency using the two Bayesian 
methods is substantial. When the normality assumption of 
the survey outcome given the inclusion probabilities is true, 
the B2PSP estimator has smaller RMSE and shorter credible 
interval than the inverse-CDF approach. Moreover, the two 
Bayesian model-based estimators are robust to the mis- 
specification in both the mean and variance functions. In 
contrast, the CD model-based estimator is biased and 
inefficient when either the mean function or the variance 
function is misspecified. Finally, the Bayesian model-based 
methods have the advantage of easier calculation of the 95% 
CI and inference based on the posterior distributions of 
parameters. This is appealing, because variance estimation 
for the alternative design-based estimators can be compli- 
cated. Woodruff’s variance estimation method for sample- 
weighted estimator performs well when a large fraction of 
the data is selected from the finite population, even in the 
moderate to extreme tail regions of the distribution function. 
However, when data from the population is sparse, the 
Woodruff’s method tends to underestimate the confidence 
coverage, whereas both Bayesian methods have closer to 
nominal level confidence coverages. 

All the three design-based estimators have comparable 
overall empirical bias to the two Bayesian spline-model- 
based estimators. However, there is a linear trend in the 
variation of bias for the sample-weighted estimator as the 
sample mean of inclusion probabilities increases. When 
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there is no association between the survey outcome and the 
inclusion probability, the ratio and difference estimators 
have relatively larger bias and RMSE than the sample- 
weighted estimator. However, in some simulation scenarios, 
the ratio and difference estimators achieve smaller RMSE 
than the sample-weighted estimator. The comparison be- 
tween the conventional sample-weighted estimator and the 
smooth sample-weighted estimator suggests that fitting a 
smooth cubic curve to the sample-weighted CDF can 
improve the efficiency, but the smooth sample-weighted 
estimator still has larger RMSE than the Bayesian inverse- 
CDF estimator. 

For normally distributed data, we recommend the use of 
the B2PSP estimator over the other estimators, because of 
smaller bias, smaller RMSE, and better confidence coverage 
with shorter interval length. The B2PSP estimator and its 
95% posterior probability interval are easy to obtain using 
the algorithm proposed by Crainiceanu ef al. (2007), which 
also has the advantage of relatively short computation time. 

The B2PSP estimator is potentially biased when the 
conditional normal assumption does not hold. One possi- 
bility here is to transform the survey outcome to make the 
conditional normality assumption more reasonable. The 
B2PSP estimator can be applied to the transformed data, and 
the draws from the posterior distributions of the non- 
sampled units are transformed back to the original scale 
before estimating the quantiles of interest. 

In our simulations with non-normal data, the inverse- 
CDF Bayesian approach was still more efficient than the 
sample-weighted estimator. Improvement in the confidence 
coverage was restricted to situations where the sample size 
is small, with Woodruff’s CI method performing well when 
the large sample assumption holds. Thus for non-normal 
data where there no clear transformation to improve 
normality, we do not recommend the inverse-CDF Bayesian 
approach when the sample size is large. Given the good 
properties of the B2PSP estimator in the normal setting, one 
extension for future work is to relax the normality 
assumption in our proposed approaches. 

We use the probability of inclusion as the auxiliary 
variable here. When there is only one relevant auxiliary 
variable, it does not matter whether the inclusion probability 
or the auxiliary variable is modeled. However, if there is 
more than one relevant auxiliary variable, the inclusion 
probability is the key auxiliary variable that needs to be 
modeled corrected, since misspecification of the model 
relating the survey outcome to the inclusion probability 
leads to bias. When other auxiliary variables are observed 
for all the units in the finite population, both of our Bayesian 
estimators can be easily extended to include additional 
auxiliary covariates by adding linear terms for these vari- 
ables in the corresponding penalized spline model. 
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One reviewer suggested an alternative weighted Dirichlet 
approach, which is simple to calculate but it does not utilize 
the known auxiliary variables in the non-sampled units. 
Another possibility is to re-define the CD estimator by using 
the spline model we have used to define the B2PSP. Speci- 
fically, instead of assuming a regression model through the 
origin, a spline model is fitted to the first and second order 
moments of the conditional distribution of survey outcome 
given the inclusion probability. The spline-based CD 
estimator should perform similarly to the B2PSP estimator, 
and its variance can be estimated using resampling methods. 

In the official statistics context, the methods in this article 
illustrate the potential benefits of a paradigm shift from 
design-based methods towards Bayesian modeling that is 
geared to yielding inferences with good _frequentist 
properties. Design-based statistical colleagues raise two 
principal objections to this viewpoint. 

First, the idea of an overtly model-based - even worse, 
Bayesian - approach to probability surveys is not well 
received, although our emphasis here is on Bayesian 
methods with good randomization properties. We believe 
that classical design-based methods do not provide the 
comprehensive approach needed for the complex problems 
that increasingly arise in official statistics. Judicious choices 
of well-calibrated models are needed to tackle such 
problems. Attention to design features and objective priors 
can yield Bayesian inferences that avoid subjectivity, and 
modeling assumptions are explicit, and hence capable of 
criticism and refinement. See Little (2004, 2012) for more 
discussion of these points. 

The second objection is that Bayesian methods are too 
complex computationally for the official statistics world, 
where large number of routine statistics need to be com- 
puted correctly and created in a timely fashion. It is true that 
current Bayesian computation may seem forbidding to 
statisticians familiar with simple weighted statistics and 
replicate variance methods. Sedransk (2008), in an article 
strongly supportive of Bayesian approaches, points to the 
practical computational challenges as an inhibiting feature. 
We agree that work remains to meet this objection, but we 
do not view it insuperable. Research on Bayesian compu- 
tation methods has exploded in recent decades, as have our 
computational capabilities. Bayesian models have been 
fitted to very large and complex problems, in some cases 
much more complex than those typically faced in the 
official statistics world. 
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Multiple imputation with census data 


Satkartar K. Kinney ' 


Abstract 


A benefit of multiple imputation is that it allows users to make valid inferences using standard methods with simple 
combining rules. Existing combining rules for multivariate hypothesis tests fail when the sampling error is zero. This paper 
proposes modified tests for use with finite population analyses of multiply imputed census data for the applications of 
disclosure limitation and missing data and evaluates their frequentist properties through simulation. 


Key Words: Finite Populations; Missing data; Significance testing; Synthetic data. 


1. Introduction 


Multiple imputation was first proposed for handling non- 
response in large complex surveys (Rubin 1987). Several 
other uses for multiple imputation have since been pro- 
posed, including statistical disclosure limitation and mea- 
surement error. An appeal of multiple imputation is that 
standard methods can be applied to each imputed dataset 
and then simple combining rules applied, which vary be- 
tween applications. See Reiter and Raghunathan (2007) for 
a detailed overview of the different rules and applications. 
Existing multiple imputation combining rules were devel- 
oped for use with random samples and superpopulation 
models (Deming and Stephan 1941). In finite population 
analyses of census data, where the sampling variance is 
zero, the combining rules for univariate estimands can still 
be applied as a special case; however, hypothesis tests for 
multivariate estimands break down. 

Motivated by the use of multiple imputation to generate 
partially synthetic data (Rubin 1993; Little 1993) for the 
U.S. Census Bureau’s Longitudinal Business Database 
(Kinney, Reiter, Reznek, Miranda, Jarmin and Abowd 
2011), an economic census, this paper derives a multivariate 
test for finite populations for use with partially synthetic 
data and extends it to the application of missing data. 
Extensions to other multiple imputation applications are 
expected to be straightforward. 

The remainder of this paper is organized as follows. 
Section 2 describes the case of partially synthetic data and 
Section 3 presents the extension to missing data. Simula- 
tions in Section 4 evaluate the combining rules for both the 
missing data and partially synthetic data cases. 


2. Partially synthetic data 


Partially synthetic datasets are constructed by replacing 
selected values in the confidential data with m independent 
draws from their posterior predictive distribution. For a 


finite population of size N, let GR NOG Ee ercege at 
indicate that unit 7 has been selected to have any observed 
values replaced with imputations. Imputations should only 
be made from the posterior predictive distribution of those 
units with Z, = 1. For simplicity, in this paper, we assume 
Lams plgms Na pbeb Vy masts. sd be the matrix 
of confidential variables that will be replaced with imputa- 
tions and X the matrix of variables that will not be re- 
placed. Let D,., = (X, Y) represent a census of all N 
units containing confidential data and assume that all units 
are fully observed, i.e., no missing values are present. Let 
Yj =1,...,m be the i” imputation of Y, and let D = 


rep? syn 
(x, Ya) lihe seteD en (OY) sé; Seleed ,zn)s\isstwbat is re- 
leased to the public. 

Any proper imputation procedure from the broad liter- 
ature on multiple imputation may be used to generate D., 
from D.,,,.. The finite population methods proposed here 
can be used regardless of whether a finite population was 
assumed in the generation of D,,. Under a finite popula- 
tion assumption, since the data are a fully observed census 
the imputation model parameters would be considered 
known and fixed. See Reiter and Kinney (2012) for an illus- 
tration of how valid inferences are obtained from partially 
synthetic random samples generated with both fixed and 
random imputation model parameters. Simulations (not 
shown) confirm the same is true in the finite population 
case. 

An analyst with access to D,, but not D,,, can obtain 
valid inferences for a scalar or vector estimand QO using the 
following quantities: 


‘a l m 
OO =— > OY (2.1) 
M j=\ 
U, = ji. 53 we (2) 
Mm j=] 
iI oa : = : = 
By AMT (Os i D:.) (Oy © 0.) (2.3) 
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where QO", i =1,...,m, is the point estimate of Q ob- 
tained from D“, U“ is the estimated variance of Q, and 
B_, is the sample variance of the Q“”,i = 1,..., m. 

When there is no sampling variance the combining rules 
for scalar Q derived by Reiter (2003) can be applied as a 
special case where U, = 0. The resulting simplification 
means the approximations of Reiter (2003) are not needed 
and the exact posterior under multivariate normal theory 1s 
(QO). Desa durukere (O,,,B,,/m). For a vector Q, however, 
the hypothesis test of Reiter (2005) relies on the assumption 
that B, is proportional to U,,, ie., the proportion of infor- 
mation replaced with imputations is the same across compo- 
nents of Q, so a different assumption is needed for the case 


U,, = 0. 


2.1 Proposed multivariate test 


In this section an alternate test is derived based on the 
stronger assumption that B, = r,,/, fora scalar quantity r,, 
and k -dimensional identity matrix 7. In other words, the 
between-imputation variance is constant across components 
of Q, and B,, is assumed to be diagonal. In both the Reiter 
(2005) test and the proposed test, one averages across 
variance components so the test is moderately robust to this 
assumption; however, the randomization validity declines 
when the estimates of O, O", i = 1,...,m, are highly cor- 
related. This is evaluated with simulations in Section 4.3. 
Comparable tests based on the assumption B,, « U,, are 
known to lose power when the assumption is not met (Li 
etal. 1991). 

The proposed test for the hypothesis H,: QO = Q, is 
conducted by referring the test statistic 


5. = Qo = Qn) (Oo = Dn) 


‘ kr 


Cc 


toan Fy 4¢,—1, distribution, where r, = 1/ m tr(B,,) /k. 
Under the assumption B,, = r,/, the Bayesian p -value 
is given by 


[PEEP OF TNO} 'O) (DE BE) 


P(B,,| Dy) AB,, (2.4) 


=/ fi 5 (Dp = BY MQ = 9) Dyas 


r,,/m 
RAD ar, 
if tee 
Ve =a be W B Mee 
| | mr * syn 
P(r oa) Gh. (2-9) 


syn 
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Thus the proportionality assumption reduces the number 
of variance parameters to be estimated from k(kA —1)/ 2 to 
1 and allows for the closed-form approximation of the 
integral in (2.4). As U,, = 0, the derivation is simplified 
from Reiter (2005). To complete the integration, we need 
the distribution of (7,, | D,,,). Extending the scalar case in 
Reiter (2003), the sampling distribution of 0”, the estimate 
of O obtained from Df, is given by (O"” | Q..,,B..) ~ 
N(Q..,> B,,). Under the proportionality assumption, this 
becomes (O"”| Q...5%,) ~ N(Qions 41). With diffuse pri- 
ors and standard multivariate normal theory for sample 
covariance matrices, we obtain 


yO One OF 


(m—1) | D... ~ Wish(m-1, J). 
syn 


(m-l)r, 


Taking the trace of each side and integrating over r, in 
(2.5) yields a Bayesian p- value of 
2 
Xi k(m—1) 
te ak 2 seg Des EE, abies Dis 
k Xk(m=1) 


3. Missing data 


The extension to missing data is straightforward. When 
U,, = 0, the combining rules (Rubin 1987) for scalar 
estimands g simplify so that (q|D.,.) ~ N(@,,(U+ 
1/m)B,,), where D,,,, 18 the set of m completed datasets. 
Similar to Section 2, the tests of Rubin (1987) and Li, 
Raghunathan and Rubin (1991) for multivariate components 
rely on the assumption that B, o U,,, and so when U,, = 
0 we derive a test under the assumption B, = r,/. 

Following derivation procedures similar to that of Sec- 
tion 2.1, the Bayesian p- value for testing H: Q=Q, with 
k- dimensional Q is found to be PCF, p¢m > 5S,|D 
where 


com ) 


Se (Q a OO: = Q,,) 
qd kr > 


q 


and r, = (1+1/m) tr(B,,)/k. 


4. Simulation study 


In this section, simple simulation examples illustrate the 
analytic validity of the proposed combining rules, first for 
the case of partially synthetic data, and then for the case 
missing data. Lastly, the robustness of the tests to the pro- 
portionality assumption is evaluated. 
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For a population of N =50,000, X = (X,,..., X59) is 
drawn from a multivariate normal distribution with mean 
zero and covariance matrix with | in each diagonal element 
and 0.5 in each off-diagonal element. Y is drawn from a 
standard normal distribution. For each of 5,000 iterations, a 
new finite population is generated and m imputations are 
drawn for m € {2,5,10}. The proposed hypothesis tests are 
conducted for H,: O = Q,, where Q is the vector of re- 
gression coefficients, excluding the intercept, of the regres- 
sion of Y on X and has dimension k, k € {2,5, 203, and 
Q, 1s the true value of Q determined from the finite popu- 
lation (X, Y). Since H, is true by design, H, should be 
rejected 100a% of the time, for significance level a = 
0.05. 

Random sampling scenarios are also simulated for 
comparison purposes. At each iteration, a random sample of 
size s =50,000 from an infinite population is generated 
from the distributions described above, prior to generating 
the m missing data and synthetic imputations. The same 
hypothesis H,: QO = Q, is tested where Q, is the vector of 
true population values. The combining rules for the hypo- 
thesis tests are those of Reiter (2005) in the synthetic data 
case and Li etal. (1991) and Rubin (1987) in the missing 
data case. 


4.1 Partially synthetic data imputations 


Let Y be a confidential response variable and X be 
unreplaced predictors. Then Y,,, 1s generated by taking m 
independent draws from the posterior predictive distribution 
f(Y | X) assuming a normal linear model, using all 
available data. 

Table 1 gives the nominal 5% rejection rate for the 
proposed hypothesis test for multicomponent estimands, 
which are seen to be close to the significance level 0.05, and 
close to the random sampling results. From these results it 
appears that the proposed combining rules for population 
data have good frequentist properties. Not shown are the 
rejection rates when the rules from random samples (Reiter 
2005) were applied to finite populations, which were ob- 
served to be quite high, typically 1, in the simulations 
conducted. 


Table 1 
Comparison of nominal 5% rejection rates for tests on 
partially synthetic data 


k=2 k=5 k = 20 
Census data 
m=2 0.048 0.065 0.052 
m=5 0.048 0.061 0.057 
m= 10 0.051 0.067 0.055 
Random sampling 
m=2 0.067 0.062 0.060 
m=5 0.054 0.052 0.050 
m= 10 0.047 0.049 0.049 
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4.2 Missing data 


Simulations analogous to the synthetic data simulations 
were conducted for the missing data case. The missing 
values of Y are imputed from the posterior predictive distri- 
bution f(Y,,, | X) assuming a normal linear model. Miss- 
ingness 1s simulated to be completely at random, with 
P( Rp = 1) = 0:3, 7=1,...,5, where; R isan, mdicator 
variable for missingness. 

Table 2 gives the nominal 5% rejection rate for the 
proposed hypothesis test for multicomponent estimands, 
which are seen to be close to 0.05, and to the random sam- 
pling results. From these results it appears that the proposed 
combining rules for population data yield valid inferences. 


Table 2 
Comparison of nominal 5% rejection rates for tests using 
completed census data 


k=2 k=5 k = 20 
Census data 
m= 0.052 0.061 0.053 
m= 0.048 0.063 0.051 
m = 10 0.048 0.058 0.054 
Random sampling 
m= 0.061 0.056 0.053 
m= 0.056 0.052 0.052 
m = 10 0.048 0.050 0.051 


4.3 Robustness 


The assumption that B, o r,J 1s striking at first glance, 
and is unlikely to be exactly true. In this section we evaluate 
the effect of strong correlations across components of Q. 
While moderately strong correlations were present in the 
previous simulations, here we increase the magnitude of the 
between-imputation variance, increasing the magnitude of 
the differences across the diagonal of B as well as the 
distance from zero of the off-diagonal elements of B. 

These simulations are set up as before, for the finite 
population case, with k = 5 and m = 5. The population in 
each iteration is generated in the same way as before, except 
that-we letere= (2 5 h0820, 0.227, OX eX 3, hep 5q) + 
nn ~ N(O,100) and X,=c-X,+8,c€ {0.5, 1, 5} 
and ¢ ~ N(0,1). Increasing values of c yields increas- 
ingly higher correlations. The large variance for n induces 
larger and more variable values for elements of B. 

The results in Table 3 indicate that while the tests have 
good properties even with moderately high violations of the 
proportionality assumption, their performance declines with 
increasingly large correlations. Continuing our assumption 
that QO represents a vector of regression coefficients, pres- 
ence of such large correlation may also be indicative of 
multicollinearity in the model at hand, so analysts faced 
with high correlation across O might take steps to reduce 
multicollinearity before applying the proposed tests. If 
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variables are of substantially differing magnitude, standard- 
ization to rescale them will reduce differences across Q. 


Table 3 
Evaluation of tests under assumption violations, k = 5,m = 5 
c=0.5 c=1 c=5 
Synthetic Data 0.059 0.083 0.145 
Missing Data 0.051 0.083 0.136 
Acknowledgements 


A portion of this work was conducted while the author 
was a student at Duke University, supported by NSF grant 
ITR-0427889 and under the guidance of Jerry Reiter, whose 
assistance is greatly appreciated. In addition, the comments 
of anonymous reviewers were quite helpful. 


References 
Deming, W.E., and Stephan, F.F. (1941). On the interpretation of 


censuses as samples. Journal of the American Statistical 
Association, 36, 213, 45-49. 


Statistics Canada, Catalogue No. 12-001-X 


Kinney: Multiple imputation with census data 


Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S. and 
Abowd, J.M. (2011). Toward unrestricted public-use business 
microdata: The Longitudinal Business Database. /nternational 
Statistical Review, 79, 3, 362-384. 


Li, K.H., Raghunathan, T.E. and Rubin, D.B. (1991). Large-sample 
significance levels from multiply-imputed data using moment- 
based statistics and an F reference distribution. Journal of the 
American Statistical Association, 86, 1065-1073. 


Little, R.J.A. (1993). Statistical analysis of masked data. Journal of 
Official Statistics, 9, 407-426. 


Reiter, J.P. (2003). Inference for partially synthetic, public use 
microdata sets. Survey Methodology, 29, 2, 181-188. 


Reiter, J.P. (2005). Significance tests for multi-component estimands 
from multiply-imputed, synthetic microdata. Journal of Statistical 
Planning and Inference, 131, 365-377. 


Reiter, J.P., and Kinney, S.K. (2012). Inferentially valid, partially 
synthetic data: Generating from posterior predictive distributions 
not necessary. Technical report, National Institute of Statistical 
Sciences. 


Reiter, J.P., and Raghunathan, T.E. (2007). The multiple adaptations 
of multiple imputation. Journal of the American Statistical 
Association, 102, 1462-1471. 


29 


NOTICE 


Statistics Canada will be discontinuing its practice to print Survey Methodology. This 
current issue (December 2012 — volume 38 number 2) will be the last version available in print 
form. Please note that the electronic version of Survey Methodology will continue to be 
available free of charge on the Statistics Canada website, www.statcan.gc.ca. 


The next issue is to be published in June 2013 in electronic format and will maintain our 
high standard of content. 


You may subscribe to “My Account” on Statistics Canada website to receive email 
notifications when new issues of the journal are released. 
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CORRIGENDUM 


James Chipperfield and John Preston 
“Efficient bootstrap for business surveys”, vol. 33, no. 2 (December 2007), 167-172. 


In Section 4.2 of this paper, under the equation 


Var (Pico) = Var, (E. 6 


s}) 


s]) Ey (Var, eee 


there are five references to the term 


s|). 


To be correct, these five referenced terms should be replaced by 


s|). 


Var, (E. Ee 


E,(Var.[9 


boot 
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ANNOUNCEMENTS 
Nominations Sought for the 2014 Waksberg Award 


The journal Survey Methodology has established an annual invited paper series in honour of 
Joseph Waksberg to recognize his contributions to survey methodology. Each year a prominent survey 
statistician is chosen to write a paper that reviews the development and current state of an important 
topic in the field of survey methodology. The paper reflects the mixture of theory and practice that 
characterized Joseph Waksberg’s work. 


The recipient of the Waksberg Award will receive an honorarium and give the 2014 Waksberg 
Invited Address at the Statistics Canada Symposium to be held in the autumn of 2014. The paper will 
be published in a future issue of Survey Methodology (targeted for December 2014). 


The author of the 2014 Waksberg paper will be selected by a four-person committee appointed 
by Survey Methodology and the American Statistical Association. Nomination of individuals to be 
considered as authors or suggestions for topics should be sent before February 28, 2013 to the 
chair of the committee, Steve Heeringa (sheering@isr.umich.edu). 


Previous Waksberg Award honorees and their invited papers are: 


2001 Gad Nathan, “Telesurvey methodologies for household surveys — A review and some 
thoughts for the future?”. Survey Methodology, vol. 27, 1, 7-31. 

2002 Wayne A. Fuller, “Regression estimation for survey samples”. Survey Methodology, vol. 
285-5223. 

2003 David Holt, “Methodological issues in the development and use of statistical indicators for 
international comparisons”. Survey Methodology, vol. 29, 1, 5-17. 

2004 Norman M. Bradburn, “Understanding the question-answer process”. Survey Methodology, 
VOL 304,5=15. 

2005 J.N.K. Rao, “Interplay between sample survey theory and practice: An appraisal”. Survey 
Methodology, vol. 31, 2, 117-138. 

2006 Alastair Scott, “Population-based case control studies”. Survey Methodology, vol. 32, 2, 
123-132; 

2007 Carl-Erik Sarndal, “The calibration approach in survey theory and practice”. Survey 
Methodology, vol. 33, 2, 99-119. 

2008 Mary E. Thompson, “International surveys: Motives and methodologies”. Survey 
Methodology, vol. 34, 2, 131-141. 

2009 Graham Kalton, “Methods for oversampling rare subpopulations in social surveys”. Survey 
Methodology, vol. 35, 2, 125-141. 

2010 Ivan P. Fellegi, “The organisation of statistical methodology and methodological research in 
national statistical offices”. Survey Methodology, vol. 36, 2, 123-130. 

2011 Danny Pfeffermann, “Modelling of complex survey data: Why model? Why is it a 
problem? How can we approach it?”. Survey Methodology, vol. 37, 2, 115-136. 

2012 Lars Lyberg, “Survey Quality”. Survey Methodology, vol. 38, 2, 107-130. 

2013 Ken Brewer, Manuscript topic under consideration. 
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Members of the Waksberg Paper Selection Committee (2012-2013) 


Steve Heeringa, University of Michigan (Chair) 
Cynthia Clark, USDA 

Louis-Paul Rivest, Université de Laval 

J.N.K. Rao, Carleton University 


Past Chairs: 


Graham Kalton (1999 - 2001) 
Chris Skinner (2001 - 2002) 
David A. Binder (2002 - 2003) 

J. Michael Brick (2003 - 2004) 
David R. Bellhouse (2004 - 2005) 
Gordon Brackstone (2005 - 2006) 
Sharon Lohr (2006 - 2007) 
Robert Groves (2007 - 2008) 
Leyla Mojadjer (2008 - 2009) 
Daniel Kasprzyk (2009 - 2010) 
Elizabeth A. Martin (2010 - 2011) 
Mary E. Thompson (2011 - 2012) 
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Layout 


Documents should be typed entirely double spaced with margins of at least 1’4 inches on all sides. 

The documents should be divided into numbered sections with suitable verbal titles. 

The name (fully spelled out) and address of each author should be given as a footnote on the first page of the 
manuscript. 

Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 


Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” 
and “log(-)’, efc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if they are 
to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, @; 0, O, 0; 1, 1). 

Italics are used for emphasis. 


Figures and Tables 
All figures and tables should be numbered consecutively with arabic numerals, with titles that are as self explanatory 
as possible, at the bottom for figures and at the top for tables. 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference is cited, 
indicate after the reference, e.g., Cochran (1977, page 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year of 
publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 


Short Notes 


Documents submitted for the short notes section must have a maximum of 3,000 words. 
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